In my last post, I discussed the Moving to Opportunity evaluation, one of the most comprehensive ever undertaken. The MTO program randomly assigned participants in poverty to a control group or an experimental group in which they received either a housing voucher to move to a low-poverty neighborhood or to Section 8 housing. It was a decades-long evaluation with implications for understanding the “poverty deconcentration” hypothesis. This hypothesis was that the most negative effects of poverty are emergent, and that they come from heavy concentrations of poverty such as those created by housing projects.
The MTO evaluation began with a mandate to report back to the US Congress on “the long-term housing, employment, and educational achievements of the families assisted under the demonstration program.” Later, other outcome objectives were added, and the evaluation ended up testing dozens of potential impacts. There is nothing inherently wrong with testing many different potential impacts. There is also nothing inherently wrong with deciding to measure different outcomes than you started with, and I’ve suggested that it’s partially consistent with goal-free evaluation. However, it is an interesting question whether these hypotheses actually count as evaluative criteria.
As Mathea Roorda has argued, “Systematic development of criteria is fundamental to reaching a defensible evaluative conclusion, yet this is an aspect of evaluation that has been neglected by many evaluators.” She reminds us that the general logic of evaluation goes like this:
Establish criteria
Set performance standards
Measure
Evaluative judgments/ conclusions
A good metaphor for the core evaluation process comes from track and field. In the high jump event, judges set the height of a bar at the beginning of a round, then athletes try to jump over without knocking it off the stand that holds it. Only athletes who succeed in doing this advance on to the next round. Here is an action shot of Cornelius Johnson, who won gold in high jump at the 1936 Olympics (yes, those 1936 Olympics).1

Without criteria (vertical jumping) and standards (a bar set at a specific height by a judge) we would not know whether any particular performance of jump was good or bad. In their absence, it would be very hard to know that Cornelius Johnson was worthy of Olympic fame. Without criteria, some observers would have undoubtedly rejected his performance.
If a program meets our criteria for success, then the correct evaluative judgment is that the program is a success; if it doesn’t meet our criteria, then the correct evaluative judgment is that the program has failed.
The MTO evaluation did not set out criteria for success in advance. Once again, I don't think this is inherently a bad thing, as long as we are sure to set out criteria for success later. I haven't been able to find evidence that this was ever done by evaluators, at any time in the process. In the writing about MTO, the attainment of certain "achievements" is described in positive terms and the failure to attain other achievements is described negatively. However, there is no attempt to make a synthesized judgment about the merit or worth of the program. Indeed, the words "merit" or "worth" are not applied to the program at all in the final evaluation report at all.
As a result, the MTO evaluation should count as a true evaluation at all, even though it is commonly called one. By my standard, many projects that are called evaluations aren’t true evaluations.
Why are the criteria missing?
Probably the most common reason that criteria are not used in projects purporting to be evaluations is that evaluators don’t walk clients through the exercise of determining them, either before or after the data are available. On the one hand, this may be a result of time pressure. On the other hand, criteria appear to be rarely emphasized in formal evaluation training. Often, people with general social science skills are thrust into the role of evaluator – but they keep doing social science instead.
In the case of MTO evaluation, we can hardly miss the skepticism of the authors of the final evaluation report towards the idea that the findings will be useful. In the conclusion of the report, they actually said:
…it is hard to imagine an elected official in contemporary America ever pushing to implement anything on a large scale that was even remotely like the experimental treatment, which required families to move at least initially into a low-poverty census tract.
Read: no one is ever going to run this program again. Perhaps they didn’t think there was much point in declaring the program a success or a failure if the program had no future. Even if the program “works” – so what?
The de facto criteria of merit might appear to be those referred to in the many tables of measured variables, since these are the outcomes are referenced in the conclusion of the final report. However, it isn’t clear what the actual standards for success or failure would be for most of these outcomes. One of the findings in the final report was that the rates of severe levels of obesity appear to have been lower in the experimental group, although in general obesity did not decrease. That is, BMIs are stratified into 30+, 35+, and 40+ groups, and there were decreases in the proportion of participants falling into the latter two groups only.2 This is probably good news for the health of the participants, but does this level of change mean that the program has merit? The way the final report is written, it is impossible to say.
Some people might readily grant that the MTO evaluation is not actually an evaluation and argue that it was never supposed to be one. Indeed, the MTO project is variously described as a “social experiment”, a “study”, and a “research platform.” If this is the case, we don’t need criteria because we are just exploring. However, the various authors of the MTO documents also repeatedly refer to it as an evaluation and refer to each other as evaluators. The object of our inquiry here is a social intervention designed to produce beneficial effects, not a psychological phenomenon, a natural occurrence, or anything else that scientists might “study” in a generic sense – it’s clearly a program with participants. In fact, I would argue that the inconsistency in referring to the project by so many labels is a symptom of a basic confusion about what an evaluation really is.
What could they have done instead?
Of course, setting criteria creating standards for those criteria is no easy task. Luckily there are some good fallbacks even we aren’t sure what to do. Perhaps the readiest criterion for a social program is that it generates more benefits than costs - it has a benefit-cost ratio greater than one. Value for money (VfM) approaches – and there are several choices available depending on the needs of stakeholders – are a fantastic default when it is difficult to set criteria. In this case of the MTO evaluation, we would need a full accounting of program costs (direct and indirect) and a design that would allow us to capture the benefits of the program. I am tantalized by this possibility in the MTO study, since we know that there were direct financial benefits that accrued to the children of the adults in the study. Did these benefits exceed the costs? How should we handle the discount rate when talking about potentially breaking the cycle of poverty in a way that could have benefits that accrue to the third generation and beyond?3
If we don’t want to literally cash out the benefits of the MTO project in monetary terms, but we still want to decide the merit and worth of the program, then we will need to stipulate criteria instead. Criteria of merit can be selected by the evaluator, stakeholders, or both.4 Criteria do not have to cover every potential dimension of value for a program, but they should cover the main dimensions. Suppose that we decide that the program will be a success if, given an acceptable expenditure of resources, it causes the crime rate among participants to drop by 28%, since this is the size of the effect for another program5 that (let’s say) costs the same amount of money.
We could also to combine several criteria of merit into a rubric. This rubric could include some failure conditions (absolute requirements) that take effect if certain very bad things happen during the program and override any positives. For example, if we find that the program increases violent crime then we might not care if it also has a net psychological benefit for participants. The rubric would include standards for different levels of performance and weights to allow categories to be combined unless failure conditions are met. A rubric involving only absolute requirements can be presented simply as a checklist. For example, the MTO evaluators could have said that the program was a success if and only if: 1) it caused the crime rate among participants to decrease, 2) it caused participants’ incomes to rise, 3) if did not cause any major harms in the process. This checklist would be more of an evaluation than the 330 page-long final report, due to the simple fact that it would use evaluative criteria.
In short, an evaluation without criteria turns out to be a social science research project, no matter how sophisticated we make it.
As a purely historical aside, although it is commonly believed that Jesse Owens was the American athlete to be snubbed by Hitler at the 1936 Olympic games, it was actually Cornelius Johnson’s victory that prompted Hitler to stop congratulating athletes on the previous day. Five minutes before the gold was awarded to Johnson, Hitler slunk away from his box, provoking the Olympic committee to reprimand him and initiating the incident with Jesse Owens the following day.
I will have more to say about the methods of the report in a subsequent post.
“Discounting” is a mathematical correction in cost-benefit analysis that reduces the value of money that we will get in the future to reflect the fact that money now is worth more than the same of amount of money later. The discount rate is the correction factor, usually by year, to reduce the value of future money.
Working through the implications of who should set criteria is something I hope to do in a future post.
Suppose that this alternate program is Becoming a Man, a Chicago-based program evaluated with an RCT that indeed reduced overall arrests by this amount.
Just to check my understanding, it sounds like your main critique is that “exploratory” evaluations are essentially glorified p-hacking.
Thanks also for linking to my article about different VfM methods. I would argue that CBA doesn’t let us off the hook from selecting criteria. It comes packaged with a single criterion (Kaldor-Hicks efficiency) so when we select CBA we are taking up a values position whether we declare it (or know it) or not. I argue, let’s define explicit, context-specific criteria and standards first, and then decide whether CBA has a place among our mix of methods.