Evaluation and legitimate judgment
Hurteau et al., 2009
For a long time, people have been wondering whether program evaluation is worth doing. While I contend that good evaluations are usually undervalued, stakeholders know that poor evaluations are overvalued. One meta-evaluative criterion by which to judge evaluations is whether they accomplish the central task that evaluations are supposed to do: make legitimate evaluative judgments.
One of my favorite evaluation studies, by Marthe Hurteau and colleagues, attempted to figure out the proportion of evaluations that actually meet this criterion using a rigorous procedure. The result was fairly shocking and is one of the findings that put me onto my current obsession of making sure that evaluative judgment at the center of a symmetric process. But before we get into methodology, let’s get clear on a couple of points about judgment.
What is a judgment?
Judgment has historically been one of the major interests of philosophers in multiple traditions, from the peripatetics to Kant to the Scottish Enlightenment. For instance, Aquinas believed that judgment was a general faculty involving everything from classifying particular things as instances of abstract categories (which he observed that nonhuman animals can do as well) all the way to rational judgments. These latter, Aquinas considered to be an active and controlled use of the intellect in which reason imposes standards to guide itself.
In the discipline of evaluation, the classic conception of judgment comes to us from Scriven, whose four-step model of evaluation concludes with the judgment step: select criteria, set standards, gather performance data, integrate results into a final judgment. The judgment in this final step turns out to be what demarcates evaluation from other, similar activities like investigative research. According to Scriven, evaluative judgments are made by observing the differences between our standards and performance data.
What makes a judgment justified?
To understand what makes a judgment justified, it helps tremendously to understand what philosophers consider to be a valid logical argument. A valid argument is one in which all conclusions follow from the premises via the established rules of logical inference, which haven’t changed much in the last few millennia. A modern philosopher could have a pretty coherent conversation about argument structure with an ancient Greek as long as she didn’t resort to any of the handful of innovations from 19th- and 20th-century logical empiricism. A valid argument proceeds deductively like this: all good newsletters are a lot of work, this is a good newsletter, QED this newsletter is a lot of work. More complicated arguments go like this: either a newsletter is a lot of work or it is not a good newsletter, but not both, this newsletter is not a lot of work, therefore this is not a good newsletter.
Researchers in the social sciences didn’t really spend a lot of time worrying about making valid philosophical arguments until a strange turn of events: Stephen Toulmin wrote a mainstream classic on the topic of justification, seemingly saving a lot of social scientists the trouble of doing what I did in college, namely, filling blackboards with proofs and memorizing the difference between modus ponens and modus tollens. However, many in the social sciences who are familiar with Toulmin’s method mistakenly believe that they have learned something akin to what what philosophers call a valid argument. Unfortunately (or fortunately depending on your perspective) they are missing out on some inside baseball from philosophy. Toulmin was something of a heterodox figure in British philosophy who believed that the old forms of logical argument were too rigid to describe the way that rhetorical arguments worked in contexts like courtrooms. Rather than conclude that many types of arguments made in courtrooms are simply invalid (as they demonstrably are) Toulmin created a new kind of “logic of justification”, wrote it down, and was promptly ridiculed by logicians, who panned him widely for being imprecise about the real argumentative steps he was trying to formalize.1 First they ignore you, then they laugh at you, then… you become the default argumentative framework for a completely different field? As I said, it was a strange turn of events.2
People with philosophical training immediately notice that Toulmin’s standard for justification is lower than the classical standard for logical validity. Justified positions merely need to be defensible. While this may seem like a very low bar, we need to remember that social scientists were not in the habit of justifying many of their key inferences in transparent ways - it is easy for social scientific writing to lapse into the pre-Baconian rhetorical forms emphasizing authority as well as more 20th-century preoccupations with theoretical elegance and ideological alignment. The arrival of Toulmin’s justificatory reasoning framework was a net positive for social science, even if makes logicians’ eyes twitch.
In evaluation, Toulmin’s justificatory reasoning works like an exoskeleton surrounding the main evaluative claims, which are themselves a kind of argument. I believe that “evaluative arguments” go like this: if we stipulate these standards and the evaluand performs in these ways, then we render this judgment about the evaluand. (Other authors are less clear on this point, but I think this is the core of evaluative reasoning and I have yet to see any other form of evaluative judgment that makes any sense.) A Toulmin-style justificatory argument surrounds the basic evaluative argument with inferential arrows pointing towards the claims it is making: for example, if I make even the simple claim about performance that a program has a 71% completion rate, this can be justified with the warrants that my definition of completion is valid, that our data collection procedures were implemented to fidelity, and so on. Each of these claims can be justified with further backing, such as the idea that my definition of completion is the same as the one used by the gold-standard version of the program, that data collection fidelity was monitored, and so on. Qualifiers are limiting statements that acknowledge uncertainty, for example, by transparently stating that the definition of completion used by the program or the data collection procedures may not have been correctly used by all sites because of staff turnover. When we are thinking more dialectically about justification, we can also include rebuttals of the justificatory argument and additional branches of warrants to handle the rebuttals.
The study
Hurteau and colleagues conducted a literature search and randomly selected 40 evaluations for their analysis. They looked for two variables:
“the presence or not of a judgment (i.e. presence of a question or a goal, presence of the criteria, and presence of the standards); and
the elements required to produce a legitimate or justified judgment (i.e. justification of the criteria, justification of the standards, and documentation of the procedures used to synthesize the information into a judgment – a methodological consideration).”
Notice that the authors weren’t even making a judgment about whether the arguments were compelling - just whether there were any judgments at all and whether they had the ingredients to potentially make a judgment. These are low hurdles to clear in my book, particularly the first criterion.
What did they find? Only half the evaluations in their sample included any kind of judgment. In other words, only half of the evaluations were evaluations at all. Within the genuine evaluations, all included criteria but only 65% included standards. Most striking, “no document offered information on the procedures used to synthesize the information into a judgment.” That’s right - there were no justifications of how the synthesis process was carried out.

The implications of withholding judgment
If we allow ourselves to extrapolate these findings to the broader population of evaluations, what would we believe about evaluation? If you hire a professional evaluator, it’s a toss-up whether you will get an actual evaluation or not, and only a third of them would make an attempt to justify their standards.
The authors ask,
“Could it be that practitioners do not establish a distinction between an inquiry that merely describes and a program evaluation that describes in order to generate a judgment or an evaluative conclusion?”
It appears that many evaluators are trapped in “descriptive” mode, perhaps adhering to the Value-free Doctrine of the classic social sciences which, as Scriven explained, does not transfer to evaluation. (Briefly, the Value-free Doctrine holds that researchers in the social sciences should refrain from imposing any particular set values, particularly personal ones, when interpreting data - Boas is the exemplar of this approach, which is key to the success of anthropology.) However, evaluation is different from the classic social sciences in that its core output is judgments about value. Value-free evaluation is an oxymoron.
When we withhold judgment, we may be engaged in “monitoring” or “learning” but not evaluation. However, this doesn’t mean evaluation isn’t happening - it’s just usually happening somewhere else, usually less systematically and transparently. When we withhold judgment, how often are we setting the stage for suboptimal decisions that harm communities later? How often are we creating the very delays against which we warn organizations? How much public or foundation money is poured down the drain when the evaluations hired to make judgments refuse? When we have the right information available and the analytical resources to synthesize it, there is a very good chance that we are missing the chance to make the most informed - and justifiable - judgment possible.
And when we don’t have enough information to make a judgment?
It is conceivable that the authors of the evaluations in which no judgments appeared were trying to do the prudent thing. Perhaps they found themselves in a situation in which there was not enough information to make a judgment. Maybe the data collection initiative that was planned was not able to go forward or the sample size was much smaller than expected. Perhaps rendering a definitive judgment would have been overconfident and thus, harmful.
I bring up this scenario because I think it reveals a common chain of events that ends up weakening evaluations: we don’t get the level of certainty we need to make judgments, so we fall back to descriptive analyses. At this point, we should stop calling whatever we are doing an “evaluation”, but we don’t. Allow this anti-pattern to happen a lot and suddenly we find that unconsummated evaluations are the norm. It is easy to twist the definition of evaluation than to admit that either 1) our evaluation was poorly planned, or 2) something got screwed up along the way and we can’t deliver what we planned.3
To resolve this, we need to make a hard distinction between two concepts: the directionality of our judgment and its level of certainty. As an evaluator with access to local knowledge and outside expertise, you will usually be able to render a directional judgment about the evaluand (remembering that judgments are always conditional on criteria and standards). Was the meal flavorful, was the film exciting, did the patients quit using drugs? If you can’t render at least a directional judgment, you didn’t spend nearly enough time with the evaluand or the stakeholders.
The level of certainty of judgments is another dimension of judgment entirely. While I may have enough information to say that a program appears to be working directionally, that judgment may come with wide uncertainty that includes a lower probability state of affairs in which the program isn’t working, e.g., I may be only 60/40 convinced that the program is working. Even though I am not 95% sure of my conclusion, this is still useful information in that it contains a directional judgment that can, at least, be updated with additional information later.4 A series of fairly inconclusive judgments, if they are recorded instead of thrown away, can be combined into a strong conclusion. By resisting the anti-pattern of retreating to descriptive studies, we can create the conditions to rise above uncertainty. Returning to Toulmin, this practice becomes more defensible when evaluators can articulate what evidence moved their beliefs and by roughly how much.
Communicating the distinction between direction and certainty in judgments is more difficult that retreating from judgment altogether. When we consider that this retreat fundamentally transforms your evaluation into a research project, the added difficulty of educating stakeholders seems worth at least an attempt.
Synthesis
Hurteau and colleagues found that evaluations lacked transparent synthesis procedures for getting from data to judgment. I’ve now proposed that evaluators should make more judgments, even uncertain ones, based on limited information. This may seem to move in the wrong direction: more judgments with less evidentiary support. The resolution is that directional judgments with explicit uncertainty require more rigorous justification procedures, not fewer. If you’re going to tell stakeholders “I’m 60% confident this program works,” you need to be extraordinarily clear about: what evidence you have, what evidence you lack, how you weighted different sources, what assumptions you’re making, and what would change your assessment. It’s a great time for a well-diagrammed justificatory argument, since the bar for justification rises when certainty falls.
I found Toulmin’s own pitiful account of these rejections accidentally amusing: “Its first reception was entirely hostile. Peter Strawson dismissed it out of hand in The Listener; my colleague at Leeds, Peter Alexander called it ‘Toulmin’s anti-logic book’; while Richard Braithwaite was deeply distressed by it, seeing me as abandoning the standards he had set up in the philosophy of science.” Braithwaite was Toulmin’s doctoral advisor - ouch.
Toulmin has also been the target of criticism from educationalists, who have suggested that his focus on justifying existing beliefs is not very helpful for teaching students whether their beliefs are in fact justified. Toulmin’s method is now ubiquitous in college writing curricula.
In some cases, the best evaluation output might be “the question as posed cannot be answered.” Note that this is itself a judgment, but about the evaluation design rather than the evaluand.
To be a good Bayesian, I must remind the reader that aggregation only works when: (1) the judgments are based on genuinely different evidence sources, (2) we can justify the inferential path from evidence to direction in each case, and (3) we’re updating rather than cherry-picking. I don’t have any evidence that other evaluation practitioners actually do this successfully, so my prescription here is entirely future-oriented.

