New Directions?

Zhao, Bryant, & Erford (2025)

Jan 05, 2026

One of the flagship journals of American evaluation, New Directions for Evaluation (NDE), recently published a meta-analysis of the contents of its own articles. In this paper, authors Ziyu Zhao, Imani Bryant, Bradley T. Erford, statistically summarized the content of 457 articles published in the journal from 2010 to 2024. While this meta-analysis covers only one journal, NDE is indeed a very important one. To the extent we can treat it as representative of the field, then this meta-analysis is a very interesting mirror. To the extent that we cannot treat it as a mirror, I’m going to argue that we should probably be interested in the results anyway, given the ways its biases point.

The Good News

The most recent period of articles in the study showed that 74% of authors published in NDE are now women. This may seem like a gender imbalance until we realize that 74% of American evaluators are women according to American Evaluation Association data. That is, NDE has actually achieved the correct gender ratio for women authors - right on the nose.

Authors are also collaborating more than they used to. In the 2010-2014 period, the average paper had fewer than two authors. In the most recent period, the average paper had more than 3 authors. I think that collaboration is good for scholarly inquiry because collaborators are more likely to check each others’ work, rein in fraud, and ask important questions before the paper lands on the editor’s desk. This doesn’t mean we should disincentivize single-author papers, but it’s probably a good trend for the field to see more collaboration.

The Bad News

Surprisingly, of all the articles published in New Directions, only 10% could be considered “research articles’ with a sample, methods, and results sections. Most of these studies (51%) were purely qualitative, with the remainder divided between mixed methods (27%) and purely quantitative studies (22%). Within these research studies, typical sample sizes were small, with the median sample size of the most recent period a meager 51 participants. Only 7% of the research studies used any kind of random assignment, and this remained stable over time. Less than a quarter of the research studies identified the sex or race of participants (much less breaking down results by demographics). The research quality of the studies was low overall, with no qualitative study reporting dependability or generalizability of results, no quantitative studies reporting reliability, and only two quantitative studies reporting effect sizes.

One of the hardest paragraphs of this meta-analysis to read was the breakdown of which statistical methods were used in the research studies: 87% of the studies used only “basic” methods (descriptives, correlations), 9% used “intermediate” methods (ANOVA, regression), and only one study used anything that the researchers called “advanced” (and, God help us, it was just factor analysis). This categorical distribution of basic-intermediate-advanced methods hasn’t changed in 15 years.

The author characteristics also tell a sad story. During the last fifteen years, the proportion of international (non-US) authors published in NDE remained flat, except for a brief spike in 2015-2019. What’s keeping NDE from expanding internationally? The authors of the paper note that University of Minnesota and UCLA are the top two institutional contributors to the journal. They conclude:

“The results suggest a need for a broader representation of non-university institutions and a global perspective to ensure a more diverse and culturally enriched focus.”

In other words, it certainly appears that they publish a lot of Americans because they publish a lot of people from University of Minnesota and UCLA. When we understand that the vast majority of evaluators are not university-affiliated, the fact that 55% of NDE authors are university-affiliated is a giant red flag - it also hasn’t changed since 2010. These authors are working in higher education contexts and… guess what? The proportion of NDE articles about higher education has more than doubled in the last fifteen years, while the proportion of articles that included K-12 participants has dropped to zero. That’s right - not a single paper including data about a K-12 student was published in NDE in 2020-2024, out of more than 200 papers.

Caspar David Friedrich’s *The Sea of Ice* (1824) depicts a ship rendered immobile and destroyed on an expedition to the North Pole.

Recap

We’ve learned that a flagship evaluation journal is 90% non-research and that the research it does consists of small sample sizes and easy Stats 101 designs with barely any randomization. The qualitative research it publishes doesn’t include any information about dependability and generalizability. Most of the authors who publish in NDE are university-affiliated Americans, while most evaluators are (obviously) non-affiliated non-Americans. This likely creates a huge bias towards writing about higher education (which about half of people use) versus K-12 (which virtually everyone uses): 17% of articles vs 0.5% of articles. This is not an outsider critique - it was published in the journal itself.

Old Directions

Perhaps the most surprising part of Zhao and colleagues’ meta-analysis to me is how stable most of these trends have been in the last fifteen years. The demographics of publishing caught up to the true gender distribution of the field, but the elite universities still dominate. Despite the pedigrees of these authors, our methods of doing research have not become any more sophisticated or trustworthy in the last fifteen years. Evaluation research has not attained the minimum methodological standards of any adjacent social science field, like sociology or psychology.

But are these just particularities of our NDE sample, rather than broader trends? Let’s think about what we would expect from the NDE sample and decide whether this invalidates our concerns or heightens them. First, studies that make it through the publication pipeline into NDE are likely to have higher than average research quality versus those that are submitted but not published. This means that the status quo for the field is probably considerably worse. Second, the authors who are included in this sample have considerable institutional resources not available to most practicing evaluators - for example, they could walk over to a colleague in the statistics or anthropology and ask a question about methodology, or consult any journal to which the university has access. This means that we should expect the NDE authors to be better informed than the average evaluator as well, which again makes our conclusions more worrying. Finally, the “desk drawer” publication bias that occurs before evaluators even decide to submit a study to an evaluation journal means that these were probably the best studies that each individual author had available to submit from among several alternatives, which likely had even worse problems, so we are seeing a positive bias within authors as well. I can’t think of any biases that point towards the inference that the quality of research in NDE would be lower than the average quality of the field. Add to this our earlier realization that only about half of evaluations appear to contain any evaluative judgments at all.

I think we have a problem. I don’t think evaluation has been going in any particular direction. It hasn’t improved in rigor or in the representation of the views of authors outside of academia or outside the US. These are problems within our field that don’t even begin to touch on the problems of our professional organizations (mine is the American Evaluation Association - I hope yours is doing better if you have a different one) or those posed by a hostile political climate for evaluation.

One might argue that these findings reflect evaluation's legitimate priorities rather than its failures. Maybe evaluation emphasizes practical applicability over methodological sophistication on purpose. Stakeholders need actionable insights, the argument might run, not publishable effect sizes. The concentration of university-affiliated authors might simply reflect who has time to write for academic journals, while practitioners focus on the actual evaluation work. And maybe the dominance of basic statistical methods indicates that evaluation has found its appropriate methodological level on some hypothetical rigor-to-practicality continuum. These are all versions of arguments that I’ve heard in real life. However, they don't explain why qualitative studies report neither dependability nor generalizability, why we can't manage both practical relevance and methodological transparency, or why the field has remained methodologically frozen for fifteen years while adjacent disciplines have advanced. I have never believed that evaluation should abandon its practical orientation, but that we've apparently decided that “practical” means “methodologically unaccountable.” A field should serve practitioners well while still meeting basic standards of evidence. The fact that we're not even trying suggests something deeper than a principled commitment to accessibility. Rigor is practical. Transparency is more accessible. Quality is key to stakeholder trust.

New Directions

But these are, in a way, the old directions that evaluation has been going. If we choose, we can make these troubling trends irrelevant. I have personally spent much of 2025 giving my time to non-academic efforts to improve evaluation - I’ve tried to publish a couple of things in journals, but mainly I’ve written for this mostly-free newsletter, developed software for evaluators (which I’ve made complimentary for subscribers of this newsletter), hosted talks and trainings for other evaluators around the US, mentored early-career evaluators, and worked within my AEA Topical Interest Group to try to build a better professional network for evaluators with great technical skills who are dedicated to improving our craft. I took on these activities, rather than trying to bang out a hundred publications, because 1) I’m not at a university anymore and nobody can make me publish if I don’t feel like it, and 2) I judge that these activities are much more important for the future of our vocation. What if these were the “new directions” for evaluation?

To close 2025 and start 2026, I’ll leave you with a round-up of some of the concepts from this newsletter that I think are good enough to bother mentioning to you again. Maybe one of them will send you off in a new direction.

Major concepts from this Substack from 2025:

Evaluators should spend just as much effort defining benchmarks are they do measuring performance and always compare performance to those benchmarks.
Evaluation is an implicit part of all decision-making because it yields preference orderings. We can fold evaluation into decision theory by acknowledging the dialectical relationship between evaluations and decisions.
Qualitative methods are mostly misused in evaluation because they usually ignore its general logic, engage in meaningless quantification, reify participant perceptions of reality, and fail to address the actual evaluation questions. However, there are some very impactful ways to use qualitative methods in evaluation, including defining the evaluand, setting benchmarks, constructing measures, and eliciting priors.
The evaluation of artificial intelligence systems belongs within the larger transdiscipline of evaluation and straddles program, product, and personnel evaluation, as well as the “evals” currently used. Evaluation is the future of AI because AI will not be able to improve much further without it.
Evaluators should write and propose budgets rather than making non-experts do it. This is feasible and much more realistic than the current practice.
The pressure to conduct pseudoevaluation can be understood in game theoretic terms as a coordination game in which evaluators must cooperate to win.
The fundamental activity of evaluation is measuring value and we can make several confident philosophical statements about what value means in evaluation.

Thank you to all the readers of this newsletter in 2025.

Julian King

Jan 29Edited

Thanks Anthony, I enjoyed your review (and the meta-study) because they unintentionally beg for a Sorting Hat treatment. The house balance the paper reveals seems pretty clear: NDE over 2010–24 looks strongly Gryffindor, with a small but vocal Ravenclaw minority, Slytherin pulling strings in the background, and Hufflepuff largely left out in the cold.

(The Sorting Hat metaphor is just for fun and not meant to be taken too seriously - still, it does throw light on an interesting pattern here. More in this post if curious: https://juliankingnz.substack.com/p/evaluator-sorting-hat).

Gryffindor feels well catered for: the dominance of human services and international development domains, and the journal’s longstanding attention to practice and profession, line up with a strong normative commitment to "social betterment".

The Ravenclaws are there, but they’re a minority voice. There's a stable slice of research on evaluation, but it’s only ~10% of articles, mostly using basic designs and descriptive statistics with very little reporting of effect sizes, reliability, or validity, so the hardcore methods crowd gets a little space but probably remains a little frustrated.

Slytherin shows up in the "Old Directions" institutional ecology: agenda‑setting power is concentrated in a relatively small cluster of US universities, and the rise and fall of international authorship suggests influence is not evenly distributed, even if everyone is speaking the language of “the field.” Ambitious evaluators still know exactly which common rooms matter.

Hufflepuff, in my typology, is about who gets included in evaluation (e.g., participation, co‑creation, stakeholders’ voices). On that front, when I look at the meta‑study I hear crickets. It would seem that either NDE, or perhaps the meta-study authors, didn't see stakeholder inclusion, power sharing, collaboration, and developing value as important. Or more charitably, perhaps they just didn't look for it. Absence of evidence isn't evidence of absence, but it does tell us something about what was considered salient.

And that brings us to evaluative reasoning - the core (imho) of what it means to evaluate. Every house does it, but they favour different approaches. What’s striking to me is that evaluative reasoning – valuing, criteria, standards, synthesis, warranting value judgements – is missing from this NDE meta-study, even while we slice the field finely by topic, method, and author demographics. I know the evaluative reasoning theme does exist in the journal because the 2012 NDE on valuing (edited by George Julnes) and the 2018 edition on evaluative thinking (Vo & Archibald), are two of my favourites, but the analytic frame didn't pick them up.

Maybe that's the real invitation for the meta-study: if we're serious about "new directions", the next round of analysis might not ask who publishes what and where, but how well our flagship journals are actually supporting evaluators to reason their way to defensible value judgements... across all four houses :-)

Program Evaluation

Discussion about this post

Ready for more?