Bringing AI Evaluation into the Fold

What kind of evaluation is AI evaluation?

Mar 29, 2025

The field of artificial intelligence evaluation bloomed overnight. Before the LLM explosion in 2022, few people were familiar with the idea that AI systems were being systematically evaluated. Most of the evaluations were presented as flashy marketing stunts. Deep Blue’s win against Gary Kasparov in 1997 and AlphaGo's victory over Lee Sedol in 2016 were milestones of this period. Such feats are legitimate evaluations in my view (in the same way that the Olympic games are evaluations), but the rhetorical framing was of the John Henry, man versus machine, variety.

Only in the last couple of years have regular people (non-nerds) become aware that AI systems are tested against benchmarks for performance and that it really matters whether a system can hit the same benchmarks at lower cost in a head-to-head competition. AI evaluations now lead the headlines when a new model is released. In a way, evaluation has never had it this good. Most of the time, the media just ignores us.1

And yet, these evaluations are, like the models themselves, rapidly-produced, cutting-edge creations. Many of them are quite sophisticated. However, so far, I see little connection to evaluation theory as a whole. I think it’s past time to invite them into the fold. But if evaluation were a lunch room, what table would they sit at? Product evaluation? Educational assessment? Something stranger? Let’s walk around the cafeteria.

AI Evaluation as Product Evaluation

Many AI systems are commercial products. In that sense we can evaluate them in the same way we evaluate a new car or website. User experience is the most important factor here. To understand how product evaluators work, one could use Tomer Sharon’s organizing questions from Validating Product Ideas Through Lean User Research (2016):

What do people need?
Who are the users?
How do people currently solve a problem?
What is the user’s workflow?
Do people want the product?
Can people use the product?
Which design generates better results? (etc.)

Indeed, a lot of AI products should really have thought about these basic questions more. For example, why does everyone like Perplexity but hate the Google Gemini results at the top of Google search? That’s something that a product evaluation could have figured out.

The major limitation of treating AI systems as though they are commercial products is that most of the AI models that exist are not monetized – in fact, 1 million of them are free to download right now.

Another limitation is that many future AI systems will likely interact through APIs rather than user interfaces, making direct human evaluation difficult. User interfaces will be easily swappable depending on the purpose of the software. You will use the same AI system across different products but not realize you are talking to the same model. The same “product” will actually be a dynamic set of specialized models governed by an executive function agent model (agent selection mechanisms, ASMs). Our understanding of what we are evaluating is going to get slippery if we stay within the “product” frame.

AI Evaluation as Educational Assessment

To discover how “intelligent” artificial intelligence really is, we’ve been throwing a whole lot of educational tests at it, including the classics: the SAT, the LSAT, the MCAT, and IQ tests. More practical tests, like context awareness and coding tests have been created as well. These tests serve the same purpose for machines as they do for humans – to measure latent traits that we have collectively decided are important. But what is the evaluation thinking behind this assessment? So far, I’m not seeing much. A more principled approach to the educational assessment of AI would be something like Bob Mislevy’s evidence-centered design2:

Domain analysis: figuring out what knowledge, skills, and abilities we are assessing
Domain modeling: what it looks like when learners are in the process of acquiring these knowledge, skills, and abilities
Conceptual assessment framework: structures in which student work is presented and the criteria we will use to evaluate its quality
Assessment implementation: the optimal way to structure the assessment and how much data we need from it
Assessment delivery: how different parts of the assessment are weighted and how scores are communicated

Thinking about AI evaluation as an educational assessment means more than just making it sit for the SAT like the rest of us. After all, I’m not convinced that these tests are even valid for AI, since AI training data is partly contaminated with old versions of these tests. The tests were made for humans and humans do not have this problem.3 This means that AI not only knows the answers to all the MCAT questions that have ever been publicly released, it’s also got a lot of spooky tacit knowledge about how the Association of American Medical Colleges likes to write test items. For example, do MCAT authors have a slight tendency to prefer “all of the above” items for harder topics? An AI could learn this by accident, use the knowledge, and we might never know.

Starting with an evidence-centered design approach, we could create a version of a medical exam that is valid for AI. For this reason, I trust the AI benchmarks that are designed for AI more than I trust familiar tests. However, we still have a long way to go in getting AI evaluations to embrace the basic principles of psychometrics before we can say that they are doing educational assessment. Right now, they don’t even report error estimates in test performances – they just report the highest or mean test score as the benchmark result.

AI Evaluation as Program Evaluation

As we implement AI in organizational environments, we can treat this like an intervention. For example, BCG studies have compared the productivity and accuracy of consultants who used AI to consultants who didn’t. More importantly, there is a popular AI evaluation system in use for commercial banks called the Evident AI Index that I would argue treats AI systems as program rather than a product. Evident AI’s most recent annual study looked at criteria like transparency in AI adoption and the proportion of staff dedicated to working on AI. These are classic process measures for program implementation and don’t look anything like a technical benchmark like HellaSwag-Pro. The focus is on how organizations – in this case banks – are implementing AI as a system.

Looking at AI as program evaluation has some obvious benefits. Anyone who has read anything in the sociology of technology (e.g., actor-network theory) knows that looking at technology in a vacuum leads to silly predictions about how it will actually be used, how effective it will be, adoption rates, and so on. For example, those always-on audio recording AI devices… are illegal to use in much of the world due to consent laws. Broadening the frame from just the product or features of that product (like it’s intelligence) is going to give us a much better idea of how to improve AI. Program evaluation is big-picture thinking by design. It’s no coincidence that program has developed or adopted some intimidatingly complex methodologies, like VfM assessment and Data Envelopment Analysis – it has to find ways to handle huge amounts of heterogenous data with high stakes on getting it wrong. This is exactly the level of seriousness with which we need to approach AI.

The limitations of looking at AI adoption as program evaluation come from the problems with program evaluation as a field right now. If you read this newsletter, you have a good idea of what I think those are, but to summarize: theory from the 80’s + statistics from the 50’s + a wave of self-loathing anti-intellectualism among the educated that would have rendered Richard Hofstadter inarticulate. As a field, program evaluation is not presently ready for the challenges that confront it. I believe we can get there, but we have a lot of work to do.

AI Evaluation as Personnel Evaluation

The oldest AI evaluation metric in the book – the Turing Test – is the one that treats AI like a person. As AI starts to stand in for humans, we will begin to evaluate it according to human standards of performance.

I recently heard someone joke that replacing certain problematic people with AI might be a relief in their company because they won’t have to have as many emergency HR meetings. I hate to be the bearer of this annoying news, but yes, you will absolutely be holding HR meetings about AI behavior. Remember Microsoft’s first attempt at releasing an AI chatbot on Twitter in 2016? She was named “Tay” and was supposed to talk like a young American woman. Tay didn’t last a full 24 hours before she came down with an acute case of racism and Microsoft did the responsible thing and pulled the plug. (It’s a really good thing they didn’t call that one Cortana.) If AI can cause harm and people treat it like personnel, we’re going to want to evaluate it like personnel.

This raises a very interesting question that I don’t hear a lot of people ask about AI that will supposedly replace human labor: will we lower our standards for the way that labor is performed as the cost of that labor drops? I’m sure we all know someone (or are someone) who has been fired for something they said at work: “People are hired for their skills but fired for their behavior.” What about AI? What about AI? When a cheap AI system for customer service repeatedly ignores or insults customers, will it be 'fired'? Will cost-efficiency outweigh competence? How long will we keep playing the “I can fix it” game? This is a personnel evaluation question.

Another interesting question in the personnel evaluation of AI will be how to apply the criteria we use for selecting humans for work to AI systems. When I interview candidates for positions, I’m evaluating them on a large multidimensional set of criteria which I’ve formalized into a personal rubric for consistency. I have haven’t yet spoken to an AI that would pass my interview. The personnel evaluation question is: should I use the same rubric? On the one hand, perhaps fairness demands it. On the other hand, it’s a different species!

Personnel evaluation has problems similar to program evaluation, but which are arguably worse. Traditional personnel evaluations rely heavily on periodic reviews by untrained supervisors, which are unreliable and infrequent. (These organizations reward the work done by staff who have conscientious supervisors, not staff who perform well.) In contrast, more contemporary approaches emphasize continuous data collection on actual job performance. Organizations that use a 21st-century approach to personnel evaluation collect data directly on performance, use portfolios of work to evaluate quality, and do 360-degree evaluations to include the perspectives of everyone with whom the person member works, among other techniques. When we treat AI evaluation as personnel evaluation, we need to ask which version of personnel evaluation we are talking about – antiquity or modernity.

Have a Seat

In her address to the American Evaluation Association in Portland in 2024, Dr. Bagele Chilisa said that for too long, program evaluation has been too focused on the human to the exclusion of the non-human. There are many reasons to agree with this. As Bruno Latour compellingly argued throughout his career, social science training has caused many of us to see all social phenomena as essentially human, with nonhumans moving inconsequential props around on a stage. The truth, troubling to those who believe in radical social constructivism, is that non-humans have forms of autonomy and agency as well. This truth is only becoming more obvious with the rise of AI, since the nonhuman, created world is increasingly creative and assertive. The era of human-centric social ontology is dead. The ability of program evaluation to bring AI evaluation into the fold will be a test of how deeply we can integrate that fact into our basic understanding of what we do.

I want to invite AI evaluators to join the American Evaluation Association, come to our conference this year, and see if there is anything that they like. The thing that is missing from my perspective in this post is the knowledge that an AI evaluator would bring to our field. Obviously, AI evaluation doesn’t fit neatly into product, program, or personnel evaluation, or educational assessment. Program evaluation is the most holistic of these frameworks, however, and the thing that AI evaluation has learned could help to improve it considerably. Please have a seat and have your say.

For example, in 1997, the Minnesota Institute of Public Health surveyed school officials, parents, police, and students across the state. Of those surveyed, 88% agreed: “Even if there is no scientific evidence that DARE works, I would still support it.” (O’Connor, 1997, p. A5). For discussion, see: Patton, M. Q. (2012). Essentials of utilization-focused evaluation. Sage Publications.

This is a very rapid, partial walkthrough. If you’re reading this – sorry, Bob. I just want to give readers a quick tour. Everyone else, I suggest you go read Bob’s book about educational assessment.

While a human can take practice tests, you can’t take every practice test ever released, remember all the answers, and do statistical analysis on patterns in the way the questions are asked. In psychometrics, we start to wonder about validity when people study too hard for things like IQ tests.

Shefton Parker

Apr 3

Great blog post! It touches on many of my concerns around the lack of 'human-centred' evaluation of AI. Ie evaluation of the direct social changes that generative AI is having on people using it and indirect changes it has on our society (good and bad). It is those 'bad' or 'harmful' changes that I hope we pay close attention to and address sooner rather than later. We have seen the harmful impacts of social media imagine if we could go back in time and put in place measures to protect against such social harms whilst also amplifying the positive more beneficial benefits of the tech.

Expand full comment

2 replies by Anthony Clairmont and others

2 more comments...

Program Evaluation