Will AI Replace Evaluators?

I conduct qualitative research on myself and register some predictions

Feb 16, 2025

Note: In this essay, I’ve chosen to assume that the reader is not an expert in AI or computers. When I say things that might require a little extra explanation for an AI beginner, I have provided explanatory footnotes.

I am generally in favor of evaluators using AI. I already use AI to speed up everything I can, from having documents read aloud to me while I take screen breaks to double checking my logic when I am using complex chains of quantitative reasoning. I’m highly motivated to automate everything I can because I have a lot of responsibilities that I regard as worth my time – staff to train and supervise, projects to manage, research papers to read – and findings ways to move quickly quickly through the less important stuff gives me more time to think about the more important things. AI is now a critical part of my workflow and I wouldn’t want to go back.

But is the logical extension of this idea that AI will replace me? What about our whole profession?

I have decided that the best way to answer that question is to explain what I actually spend my time doing in a day and then ask “Which of these things could have been done with AI?”

As Arvind Narayanan, one of the authors of AI Snake Oil, has pointed out that AI doesn’t take automate jobs, it automates tasks.

It stands to reason that, if your job is full of tasks that are easily automated by AI, your job is likely to be at least partly automated and you will probably see some downsizing in your sector. If your job is doing one kind of task over and over and again, and AI can do that task almost as well as a human, well, you probably don’t have a job anymore.

To figure out what kind of job evaluators have, I submit the results of my qualitative observations on my work day.

An Average Morning in the Life of an Evaluator

Below are the first ten things I did on one randomly-chosen day last week, along with a rating of how AI-Replaceable my work was on a scale of 1 to 5 stars, with 5 being an excellent use case for AI. Afterwards, I provide an explanation for my ratings.

Gain access to confidential participant data for a program from an encrypted email with two-factor authentication. ⭐️
Set up custom gift-card rewards for interview participants and add funds to the account. ⭐️
Randomly select participants from a confidential database to take part in a study. ⭐️
Document detailed instructions on data collection for my team using project management software. ⭐️⭐️⭐️
Respond to emails letting people know I had received and reviewed information. ⭐️⭐️⭐️⭐️⭐️
Help a colleague debug code for an analysis. ⭐️⭐️⭐️⭐️⭐️
Pay an invoice for a large company expense in a division I oversee. ⭐️
Double-check a statistical analysis I ran for a needs assessment using publicly-available data. ⭐️⭐️⭐️⭐️⭐️
Meet with a potential business partner and discuss how we might collaborate on an open RFP, including sharing our past experiences in the area. ⭐️
Meet with an evaluation team on which I play an advisory role and mainly listen and ask a few questions while the team works together on a product for the stakeholders. ⭐️

Now let’s group these ratings in descending order.

⭐️⭐️⭐️⭐️⭐️: Five star ratings go to tasks that AI systems already excel at, such as having routine conversations, writing code, and doing data analysis. In actual fact, I don’t use AI to respond to email because my email is often confidential and I think it’s important to actually talk to people myself. I think talking to colleagues and stakeholders is not a waste of my time.

Funnily enough, the reason I got a request to help me colleague debug her code was because she had tried AI first and it failed to identify the error. I spotted the problem without AI, but ended up discussing the code with AI anyway to speed up unit testing, which probably saved me an hour.1

As for the data analysis, I did use AI but encountered a performance issue. In my original analysis I ran a Bayesian multiple regression with nonlinear predictors without AI, but when I asked AI to check it, the analysis failed over and over again. Watching the Code Interpreter spit out Python code to try to do what I had just done in R in a handful of lines, then wipe out and hastily try again was like the watching Peter Sellers get his hand stuck in different objects while trying interrogate murder suspects The Pink Panther Strikes Again.

⭐️⭐️⭐️: I scored one task with three stars – writing instructions for my team in project management software. I’m talking about making a numbered list of tasks to carry out in order to complete an objective. I didn’t use AI for this. On the one hand, if I had asked, AI would have happily made an imaginary list of tasks for the objective which I then could have edited heavily to arrive at my actual list. However, for an AI to know what needed to go into that list would have required a huge amount of context. Writing task lists is in the AI wheelhouse, but the amount of tailoring needed put this back in human hands.

⭐️: All my one-star ratings have to with money, confidential data, or things that demand personal interaction. Allowing an AI to handle my company credit card is not a risk I am willing to take. One day, when the hallucinations are under control, I will consider it. (People who supposedly trust it to book their flights must have an entirely different subjective experience of air travel than me – imagine the day it hallucinates that your name is spelled differently than it is on your passport.)

As for confidential data, I sign very scary documents prohibiting me from dumping certain information onto the internet. HIPAA, FERPA, NDAs, city and county and state contractor agreements, as well as my own respect for the privacy of program participants prohibit me from allowing people to be identified during or after an evaluation. I’m aware that I can disable the setting that allows for my data to be used for training purposes on most commercial AI, but this comes down to how much I trust companies like OpenAI and Anthropic.2 I trust Google more because their security team is world-class, but sharing certain confidential data with them would still breach some of my agreements. (More on Microsoft and security below.)

Finally, there were some distinct moments in my morning that demanded personal interaction. My teams spend most of their time working asynchronously, so when we get to see each other is is an important opportunity for me to listen to them, identify blocks, and teach. I ended up only spending about an hour with my teams on this particular morning, and I wouldn’t have outsourced a minute of it to AI. There is simply too much interactional work going on there: we are doing things that humans are good at, at high efficiency. I work with smart people. The meeting with the potential collaborator was very important too. They were trying to decide things about me that AI (even one trained on a corpus or my work or my personality) couldn’t tell them. Am I going to be easy to work with or annoying? Do I stick to what I know or am I a bullshitter? How will I treat their clients? Not only did they want to hear what I said, they wanted to hear how I said it.

In summary, most of the tasks I did on my randomly-selected morning as an evaluator would not have been replaceable by an AI. I gave 6/10 one-star ratings for replaceability, 1 three-star, and 3 five-star ratings. However, I only decided to use AI for 2/3 of the five-star tasks, and of the remaining two, AI wiped out at one of them (double-checking my statistical analysis). Right now, AI is not ready to fully replace me or any members of my team. But what about in a couple of years? There wouldn’t be much point to this essay if I only considered what AI can do right now. Let’s think about some changes that could occur that would make evaluators more replaceable.

Scenario #1: AI Gets Better at Handling Money and Confidential Information

While I am not personally ready to hand AI the keys to any financial account I control, one key indicator to watch will be how banks choose to adopt AI. A 2024 analysis by Citigroup concluded that about half of banking jobs are at risk for AI replacement. It also advanced the argument that this transformation would be massively profitable. If big banks are willing to hand control over to AI systems to manage money, this suggests that they have seen or are developing systems that they trust for this purpose.

A similar indicator of what is come might be the fact that Anthropic has figured out how to make its AI models secure enough to be HIPAA compliant. This isn’t automatically available for all users, but the fact that Anthropic is willing to enter into a Business Associate Agreement (BAA) with a zero-retention policy for data adds legal teeth on their privacy protections.3 BAAs are required under HIPAA and are binding contracts governing the whole lifecycle of patient health information. This is the industry standard for confidential information and apparently Anthropic is ready to live up to it.

Another obvious answer to my confidentiality worries is locally-hosted AI. So far, these systems are not quite sophisticated enough to meet my requirements, but improvements are on the way. For an expected retail price of $3000, I will soon be able to buy an NVIDIA Grace Blackwell superchip that can run AI locally with 128 GB of memory and 4 TB of storage. It will be small enough to fit on my desk like an external hard drive. This AI computer would allow for completely secure interactions with data. By connecting such a system to the Network Attached Storage (NAS)4 I already have, I would be able to run AI analyses entirely independently of my main computer without any of the energy requirements of cloud computation or moving any information out of my office. The AI would be able to tick away in the background at any time of day, processing requests and storing results.

NVIDIA’s Project DIGITS is a consumer-grade AI computer. As you can see, it’s not much bigger than a standard dictionary. Source.

If this scenario comes to pass, I will be able to use cloud-based services like the ones we already have to handle very sensitive data without fear of security breaches. Agreements between evaluation firms and AI companies will cover legal risks in the unlikely event of a hack. Alternatively, I will use a local AI computer to handle such data, configured to my personal specifications. In either case, evaluators will no longer be human gatekeepers of confidential information, and thus, that much more replaceable.

Scenario #2: AI Gets More Intelligent and Context-Aware

While the previous scenario focused on the fairly routine but sensitive aspects of being an evaluator, this scenario focuses on the higher-level critical thinking involved in the job. In the examples from my experiment, such tasks included running a sophisticated statistical model and created detailed instructions for my team.

I anticipate that within a couple of years, the problems with the Python environment5 that led to the AI replication of my analysis failing will be fixed and it will be running Bayesian analyses. It is already able to do some Bayesian statistics, just not everything I want it to do. A very likely scenario is that someone will develop very good statistics AI software that runs locally.6 Imagine a version of SPSS or Stata that accepts voice commands and converses with you about the kind of analytic choices you have to make, then executes the analysis and talks with you about the results. I am 99% convinced that such a software program is in development right now. Many statisticians fear developments like these (ask them about menu-based software or graphical modeling) because ease-of-use improvements persuade novices that they can do statistics too. How terrible.

Beyond the raw horsepower required to do sophisticated analysis, new AI systems could also change the game by becoming more context-aware. Writing that list of instructions in my example above would have been easy if the AI had full read access7 to everything in my project management software, the company cloud, and my local machine.

To decide whether this future is coming, I’m watching Apple Intelligence and Microsoft Copilot. It turns out that few people actually wanted to use Copilot due to a bumpy rollout and early security flaws. For example, Copilot has a problem with “oversharing” company information within the organization, internally leaking secret documents outside of need-to-know audiences. Microsoft’s focus on creating products for collaborating across entire organizations hit its zenith with Copilot, which hoovers up data from you and the CEO’s desktop alike. Apple has always played to the individual user, rather than to the organization, and Apple Intelligence is one more demonstration of this strategy. Most of Apple Intelligence is processed on-device, with some complex requests sent to Apple Private Cloud Compute, which is not visible to Apple. If you want the additional heft of ChatGPT, you can now connect it to Siri, which is also getting a full revamp. As usual, Apple took their time and made a product that people will actually want to use instead of shipping Temu garbage.

Microsoft versus Apple stock over the last year.

Something about ChatGPT only knowing what you tell it is very reassuring for most people. However, it turns out that we rarely tell AI enough to help us the most effectively – one of the most common bits of advice for prompting is give a lot more context. The logical extension of this is letting it look at everything on your computer (except for areas you specifically exclude). An increase the depth and breadth of AI inferences would mean fewer tasks that require an evaluator’s specific human attention. Many of the tasks that involve depth and breadth are the ones that we think of as core to our profession – sophisticated data analysis, seeing the forest for the trees, taking an unusually large amount of evidence into consideration before making a recommendation. AI getting better at these things would not only be job-threatening, it would be ego-threatening for many evaluators.

Scenario #3: Major Changes in Knowledge Work Culture

In knowledge work, staff time is the main driver of cost. Right now, knowledge workers spend a lot of our time doing very human things. We have unnecessary meetings. We meet and greet. We buy each other coffee to hear each others’ opinions about a project. We circle back. We have a little office party sometimes.

Suppose that, instead of meeting with my teams, I sent an AI replica who mainly listened and asked a few helpful questions. Suppose that it had access to everything I was comfortable publicly disclosing and knew a lot about my mentorship goals for the people I’m training too. What if it never got tired of answering my team’s questions? What if it was faster at helping them with some tasks than me? What if it reported back everything important to me at the end of the day?

Suppose I sent this AI representative to meet the potential business collaborator in my example as well. What if they sent theirs too? What if our replicas reported back to each of us about the prospects for a working relationship before deciding whether to have the “real” meeting? In such a situation, I would see no point in concealing my real personality (e.g., high conscientiousness, low agreeableness) if it helped find a good match. What if my replica was always running around doing this with potential collaborators – a little business matchmaker?

Suppose that the norms around making mistakes were loosened so that we balanced considerations about decision speed with accuracy. When the AI made a mistake or hallucinated, everyone said “ah well, try again” and moved on instead of holding individual humans accountable for not catching every mistake. Suppose stakeholders started to feel the same way. To err is inhuman, to forgive is efficient.

Suppose it took a single work day to write a 100-page evaluation report and this was considered normal. What is everyone expected the report to be written by AI and they were alright with this? They also expect to use their own AI to interact with the report, asking it questions and generating new analyses using linked data files. Imagine that these reports were better than the average report that is currently produced but cost less and are delivered much faster from the moment they are requested.

I don’t particularly want to live in this hypothetical world, but then, I don’t want to live in the current one either. We don’t get to pick the world.

So, how will I know whether we are about to live in this hypothetical world? Early peer-reviewed research creating AI replicas has shown that a 2-hour interview with subjects (n = 1052) is sufficient to create a replica that gave responses to new questions that were 85% similar to the ones given by the original person. I’ll be watching to see if the concept takes off or it’s just too weird for most people. Next, consider that a pre-registered peer-reviewed study (n = 758) showed that consultants with access to AI performed their work 25% faster and 40% better. According to an IBM survey, about 9 in 10 C-suite executives who purchase consulting services (n = 400) say that they’re actively looking for services that incorporate AI and technology assets and that they expect consults to provide these services. In other words, expectations for consultants have already been reset by AI. I’m watching to see whether we start to value speed and timeliness as much as we value accuracy and accountability. If that happens, evaluators will be more replaceable because those of us with excellent AI systems will be able to be in several places at once and one evaluator will do the job that two or three used to do.

p(Replacement) for Evaluators

I believe that there is a high probability that many evaluation jobs will be lost in the AI revolution. I have already ordered the three scenarios, which are not mutually exclusive, in order of the probability I would assign to each. I think the probabilities are:

p(Scenario #1) = 99%
p(Scenario #2) = 95%
p(Scenario #3) = 30%

Since I think that these are all independent events, then by my own logic the probability that at least one of these events occurs is far in excess of 99%. I think any two will definitely drive job loss, and the probability of at least two is greater than 95%. This math depends on my initial probabilities being correct, of course, but it’s sobering.

While it’s true that, as I said at the beginning of the essay, AI automates tasks not jobs, if a large proportion of your job becomes automated, there will be a market pressure to reduce the hours for which you are paid. Your competitors will propose to do the job faster than you for lower pay because their costs are lower. Over time, this drives firms to combine multiple jobs into one. It drives firms that fail to do this out of business. This argument assumes efficient and competitive market conditions where cost-cutting is a dominant driver. In practice, there may be countervailing forces – more on this in a moment.

What I don’t have any guesses about are the proportion of jobs that will be lost in each scenario. I can only make some claims about market factors independent of technological development that will drive the loss:

Stakeholder demand for AI use
Stakeholder demand for lower-cost evaluation
Stakeholder demand for faster evaluation
Speed of AI adoption among critical mass of evaluators
Efficiency gains of evaluators who adopt AI

Let’s say that demand for AI is high, that stakeholders want lower costs due to a sectoral collapse of prices (in knowledge work), and that stakeholders want faster evaluations due sectoral shifts in expectations. This primes the evaluation market for a race to adopt AI as quickly as possible and realize the biggest efficiency gains they can. A few winners emerge and become price-setters. Evaluation jobs are lost and enshittification ensues. In an one reality, humans compete against nonhumans for evaluation contracts, which are executed entirely by powerful AI with no human involvement (I personally put the probability of this at 20% by 2040). In this future there is a high level of replacement.

On the other hand, let’s say that stakeholders react negatively or neutrally to AI. By disapproving or ignoring its use, they may set up conditions in which lower-cost and faster evaluations signal poor quality. In this case, the market pressures to adopt AI are lessened. Efficiency gains still go to evaluators who use AI to increase personal productivity, but these gains do not need to be as extreme in order to compete. If the market goes in this direction, then replacing knowledge work with AI systems will not impose as many costs on evaluators. For example, the kinds of practices envisaged in Scenario #3 impose costs on the evaluator by making them less knowledgable about what is going on in their projects and teams, which has perhaps the same functional impact as being sick and missing a week of work. In other words, Scenario #3 involves imposing technical debt on oneself. And, of course, there are the direct costs of adopting sophisticated AI systems, which can be very expensive. Lower demand for AI among stakeholders may be the market conditions that bring about the best of both worlds, in which evaluators get the efficiency and quality gains of a new technology without the race to the bottom of quality and price. Some of these efficiencies can be passed on to stakeholders and some can be retained by evaluators – sharing the benefits of new technology. In this future there is low replacement and high augmentation. The profession may even grow as evaluation costs fall somewhat and quality improves.

If there is one thing that I think might shield evaluation from what seems to me to be the likely fate of a lot of the knowledge economy, it is this: evaluators are entrusted with helping public institutions like government agencies and nonprofits to understand themselves and make the right decisions. While many people are cavalier about passing marketing or film production to AI if it saves a bit of cash, very few responsible adults are ready to put our core public institutions on autopilot. We want humans (maybe we want them to be assisted by smart computers too) – but we definitely want humans – determining the merit and worth of society’s collective projects. As long as evaluation is considered part of the governance of institutions, rather than just another kind of service expense, the market may not demand rapid AI adoption.

To conclude, p(Replacement) looks high right now for many evaluators. Technological improvements seem virtually certain, which will probably lead to productivity gains and less demand for evaluation labor. However, what seems far more uncertain are the stakeholder expectations that will drive market conditions. If stakeholders demand AI use and expect all efficiencies to be passed on to them, some very strange things will happen to the profession, likely including major job loss. Perhaps evaluation’s special status as a necessary decision-making tool for public institutions will give human work special status in the minds of stakeholders. Right now, I am convinced that, more than anything else, the future of our profession depends on what they think about us.

“Unit testing” refers to checking code at the smallest levels of functionality, such as line by line. It takes a while and AI is good at helping with it. The fact that AI saved me an hour doing this means that my company AI subscription paid for itself for the entire month.

OpenAI makes ChatGPT and has gone from being very transparent to very opaque. Anthropic is another AI company that makes Claude. It is bit more transparent and is a public benefit corporation. The point of my remark here is that both companies are relatively young and probably do not have the security chops of say, Google or Apple.

Zero-retention policies mean that companies don’t keep copies of your data beyond the time needed to complete the task you requested. This is a good privacy expectation for evaluators to insist upon when using AI products.

NAS is like an external hard drive that is not connected to my computer with a cable. Instead, it talks to my local network and lets me store files using local wifi. I’m proposing hooking this up to the NVIDIA AI computer.

When AI runs statistical analysis in the programming language Python, it creates a “virtual environment” to do this. This virtual environment is basically a container that makes sure your code can run no matter how the rest of your system is set up.

Systems are said to run “locally” when they do not require connection to the internet and can process entirely on hardware that at your location.

“Read access” refers to the permission to see, but not edit files.

Jane Smith

Feb 16

“People are resistant or unwilling to provide enough context to allow us to help them effectively” is the reality of therapists, doctors, evaluators, friends, coworkers, spouses… and so many others. One of the great advantages of AI that I anticipate is its ability to let people sidestep social taboos around being vulnerable and forcing people to actually learn how to ask for help.

And asking for help effectively is a skill! Talking to a computer—programming, chatGPTing or otherwise—is a great way to come face to face with one’s own flawed assumptions, faulty logic, and communication gaps.

Expand full comment

1 reply by Anthony Clairmont

1 more comment...

Program Evaluation

Discussion about this post