Evaluation is the Future of AI

Welcome to the Second Half

Sep 30, 2025

Within the community of people following the development of AI systems, I think 2025 was the year that a consensus began to emerge about the critical importance of evaluation. Recently, during a conversation about biomedical AI, I was reminded of the blog post by Shunyu Yao about AI evaluation.1 Dr. Yao is a researcher at OpenAI who has made major contributions to AI evaluation in his own right. Yao argues that in the recent past AI researchers spent much of their energy on solving the algorithmic problems of how to make reinforcement learning (RL) work, but that now we have a general recipe for RL. In his blog, Yao asks “So what comes next?”:

The second half of AI — starting now — will shift focus from solving problems to defining problems. In this new era, evaluation becomes more important than training. Instead of just asking, “Can we train a model to solve X?”, we’re asking, “What should we be training AI to do, and how do we measure real progress?” To thrive in this second half, we’ll need a timely shift in mindset and skill set…

One way of reading Yao’s position is that he is urging a perspective shift within his field from implementation to evaluation. The thing works, but getting it to work well will require us to define what “well” means. This is no longer just an engineering problem. Welcome to axiology, the phenomenology of user experience, multi-attribute utility, stakeholder theory, and so on – the bundle of which are commonly paraphrased by us technophiles as “alignment.” These deeper problems were always there, but they awaited breakthroughs in engineering to become truly urgent. Yao is announcing that we have arrived.

Optimize for What?

I think that, like many subject matter experts, Yao has achieved the insight that a philosophical approach to evaluation helps us figure out more than just how to measure things, but also which direction we ought to be going. Given enough resources – and oh, what resources we have available for this task – people are very good at optimizing systems. But, we are generally optimizing towards particular evaluation benchmarks, and rarely towards anything like holistic functioning.

I believe that this is because, while it’s easy to talk about taking a holistic perspective, it is much harder to attain such a perspective in practice. Imagine an employer claiming that they holistically evaluated your performance and decided that you did not deserve a raise, without producing any specific evidence or standards for their judgment. Evaluation is much easier overall when our benchmarks are oriented towards specific aspects of performance. It is easier to evaluate the performance of high-jumpers than the performance of ice skaters due to the sheer number of benchmarks involved in scoring the latter.

Given that we generally want a few specific - rather than a pile of holistic - evaluation benchmarks, the choice of those benchmarks tends to drive our optimization. This is especially true for AI, where engineers create entire virtual worlds for AI to practice things like navigating web pages and bipedal walking. Choosing the right benchmarks leads to building systems that reinforce good performance - that logic will sound very familiar to professional evaluation and it is literally how AI systems are developed.

As an example, consider the relatively recent articulation of the construct of sycophancy. While users certainly had the intuition that early AI chatbots were too agreeable, it was only in 2025 that this was widely considered a major quality issue. Once it was, evaluations of sycophancy were created and now the major players are trying to optimize around increasing progressive sycophancy (AI agrees with you when you’re right) and decreasing regressive sycophancy (AI agree with you when you’re wrong).

Priors Jump Start the Learning Process

So engineers build environments for AI to acquire, practice, and optimize their skills. Then, AI systems graduate from the aforementioned virtual worlds and, like so many students before them, wipe out in real conditions that turn out to have features poorly represented in the simulations. Learning to speak a foreign language at half-speed from your kindly instructor does not actually prepare you to understand even an easy conversation at a noisy bar. As Yao says:

“OpenAI made tremendous progress down the path, using RL to solve Dota, robotic hands, etc. But it never came close to solving computer use or web navigation, and the RL agents working in one domain do not transfer to another. Something is missing.”

So what was missing?

“It turned out the most important part of RL might not even be the RL algorithm or environment, but the priors, which can be obtained in a way totally unrelated from RL.”

In large language models, these priors were obtained by training the model on an immense corpus of written text using the handy trick, invented at Google, of multi-head self-attention. For other kinds of tasks, we’ll still use Google’s transformer, but we’ll need to train on other kinds of data and we’ll probably have to invent a few new tricks to simplify these learning processes too.2

For Yao, priors played an important role in solving the implementation issue in the “first half” of AI’s development. I’d like to suggest that priors have a crucial role to play in the second half as well. The main reason for this is that evaluation works best when carried out using Bayesian, rather than frequentist, learning, and Bayesian reasoning requires priors. Evaluation works better as a Bayesian enterprise because it usually requires the accretion of multiple sources of evidence over successive trials and the synthesis of multiple kinds of evidence, including qualitative data, into decisions. It also usually draws from a knowledge base of existing research and integrates expert judgments at some point in the process.

This shift from deterministic testing to probabilistic learning will help separate genuinely useful AI from systems that merely excel at benchmarks. Once we go Bayesian, we can start plugging our conclusions directly into decision-making frameworks in a way that is fully coherent. Instead of waiting for perfect studies, we can act on accumulating evidence. Instead of dismissing qualitative feedback as “anecdotal,” we can incorporate it as evidence that updates our beliefs. Instead of arbitrary cutoff scores, we can express genuine uncertainty about whether an AI system is ready for deployment. Bayesian reasoning lets us do all this, and more.

*The Education of the Princess* (1625), by Rubens, commissioned by the princess herself, Marie de’ Medici. Here we have it all: learning, enclosure in some kind of Platonic grotto, dramatic messages from the heavens, the messy state of human knowledge strewn on the ground, and not a little sycophancy - allowable in her case, since 1) she grew up to be the most important woman in the world and 2) she actually could play that lute.

Moving Targets Can be Good, Actually

Another of the classic evaluation lessons that AI researchers are learning is that benchmarks that are helpful at the beginning of the process of developing the evaluand may not serve any longer as it matures: they may need to be changed in both difficulty and kind. Dr. Yao explains that the general recipe of reinforcement learning has led to a situation of incremental improvements in models so that AI systems are converging on near-perfect performance. When this happens, new, harder benchmarks are introduced and the old benchmarks are quickly forgotten. However, end users still have major problems using AI systems in practical situations. At this point, Yao suggests, we can either keep feeding the model harder versions of standardized tests or we can introduce evaluations that are different in kind: evaluations that involve more humans interactions and require sequential solutions using memory.

To this general insight, I would humbly add one suggestion for AI developers: consider that standards need not be standardized. Standards are essential, but standardization is overrated. There is no reason to apply universal standards to different evaluands. We do not use the same math exam for 10 year-olds and 16 year-olds. Put differently, we need different priors for different contexts. A language model intended for medical diagnosis simply faces different prior expectations than one designed for creative writing. This is not special pleading - it’s acknowledging that our prior beliefs about performance should reflect the specific use case.

Inflexibility about standards seems me to be a product of a insufficient understanding of where they come from, which tends to result in a kind of superstitious reverence. The alternative to inflexibility, meanwhile, is not arbitrariness – which is itself more often a product of inflexibility – but rather, justifiable modifications in the standards we are using. For example, once we understand that standards are potential estimands,3 we can produce model-based estimates of them based on particular characteristics of the evaluand. In AI, this may mean “correcting” eval performance by model size or the use of post-training. In product evaluation, this often involves creating category-specific standards like “laptops under $2000” or “four-door sedans.” At the moment, AI research appears to still be in the phase of standardized standards, but differentiation of systems tends to results in a specific standards.

Outcomes > Outputs

In evaluation, we make a strong distinction between outputs and outcomes. Outputs are the direct, first-order consequences of running an intervention and they are comparatively easy to achieve versus outcomes: a program that is supposed to inform the public about the dangers of alcohol overconsumption will output social media posts, billboards, and radio ads. Outcomes are the actual raison d’être of the intervention: increased knowledge of the health risks and downstream behavior changes.

AI evaluation has moved apace without keeping this distinction in mind and has been repeatedly, publicly, banging its head on the low ceiling of its self-expectations. Again and again, products that “perform well on evaluations” have turned out to be annoying and useless junk. The Humane AI Pin and the Rabbit r1 will form a landfill layer that will jump out to contemporary archeologists like the K-T boundary. Every so-called breakthrough in functionality is one nail-biting live demo and a year ahead of reality.

Yao refers to this general issue as “the utility problem.”

“AI has beat world champions at chess and Go, surpassed most humans on SAT and bar exams, and reached gold medal level on IOI and IMO. But the world hasn’t changed much, at least judged by economics and GDP.”

In the classic vocabulary of the discipline of evaluation, the old AI benchmarks focused on outputs, but the new AI benchmarks should focus on outcomes. The reason the old AI benchmarks are outputs, even if they are impressive outputs, is that we don’t actually need machines to play chess or Go. We would not have spent this much money on a Go machine. We need them to help with drug discovery and autonomous robots and data analysis. To accomplish this, we need to change the benchmarks so that we are optimizing for different things.

What’s next?

If evaluation is the future of AI, how should we expect the landscape to change? First, we need to accept that there will be tradeoffs in how we spend our time and resources. For example, as Dr. Narayan argued recently, non-AI companies should probably think more about evaluation than about training their own models.

Training requires well-labeled, balanced datasets to achieve anything even remotely useful and most organizations don’t have the horses. Putting our efforts into evaluation means that we will probably see some pull-back from other AI strategies and that is a smart move.

Second, we should prepare ourselves from a breakaway effect between the old benchmarks and user perceptions of AI quality. I predict that in 2026, one model will emerge that will be popularly acknowledged to “just work better” for most automated tasks even as it falls behind in some of the classic performance benchmarks. If your AI can solve my problem reliably, I don’t care whether it places 15th on Humanity’s Last Exam.

Third, we can anticipate a decrease in overfitting to evaluation metrics as our benchmarks become more focused on “utility,” to use Yao’s term. The more realistic an evaluation metric is, the less problematic overfitting will be. Asking “How many images of stop signs were correctly classified?” is an old benchmark, but “How many Waymos ran stop signs last month?” is an example of a new one. And, as my example suggests, we should seriously consider focusing more resources on post-market evaluation (including small trial groups), since these are the most realistic cases.

Fourth, expect a change in the composition of evaluation teams to bring in people with more diverse skillsets. The old evaluation benchmarks were classic data science tasks: label images, score performance as a percentage of correct responses on known-answer multiple-choice questions, find the most efficient algorithm to solve a problem with reinforcement learning. The new shift towards evaluation will involve lots of real-world testing, gathering and interpreting new kinds of unstructured user data, surveys, focus groups, interviews, and participant observation. These are the kinds of evaluations you can’t just conduct at your computer.

Fifth, and finally, I think we will discover some reverse salients4 in evaluation as a discipline as it gets drawn into AI evaluation specifically. A reverse salient is a part of a sociotechnical system that is less developed than other parts, creating a gap in the advancement of the technology. Ideally, learning about the reverse salient causes innovators to swarm the bottleneck and innovate. Right now, the larger evaluation field is not fully ready to parent our fledgling subfield. In particular, our skill gaps and bad ideas will become society’s problem if they are applied to AI evaluation. On the other hand, formally-trained evaluators have much to contribute if we can rise to the occasion.

Thanks to Tommy Ly for reminding me about Dr. Yao’s position on this.

To clarify the technical mechanism: reinforcement learning agents trained for specific domains (like Dota 2 or robotic manipulation) failed to transfer their learned behaviors to new domains. The breakthrough came from pre-training large language models on massive text corpora using transformer architectures. These pre-trained models developed rich internal representations of world knowledge, common sense reasoning, and language structure - the “priors.” When RL techniques (like RLHF - Reinforcement Learning from Human Feedback) are then applied to these pre-trained models, they can then be fine-tuned for diverse tasks (chatbots, coding assistants, web navigation) with far less task-specific training than would be needed starting from scratch. The point here is that the same RL algorithms that failed to generalize when applied to randomly-initialized agents now succeed when applied to agents that already possess language priors! Modern systems like ChatGPT and the o-series models use this recipe: massive pre-training for priors, followed by RL for task-specific optimization.

As Nietzsche or Comte would say, we are still in a metaphysical attitude regarding standards. They appear “otherwordly” – which seems both a cause and an effect of the little effort that has gone into setting them. If you’re worried about floating away, try a simulation.

Hughes, T. P. (1993). Networks of power: electrification in Western society, 1880-1930. JHU press.

Program Evaluation

Discussion about this post

Ready for more?