My serious mentors have all overlapped, without colluding with each other as far as I know, in teaching me a very important lesson: when asked a question, I need to answer it directly and completely. Early on in my doctoral studies at UC Santa Barbara, after I had learned to stop talking when I ran out of evidence, I would often fail to answer the question I was asked. When this happened, my advisor would calmly order: “Speculate.” This was how I learned that my ignorance did not get me off the hook. Even if my answer contained only my own educated opinion, this was still better than a dodge as long as I appropriately framed my conjectures as such.
My general view on research methodology is based on this principle. The most important thing we can do for stakeholders is to answer the questions they asked, even if we have to tell them that our answers have considerable uncertainty. In statistics, this idea was immortalized by John Tukey, who said “Far better an approximate answer to the right question, which is often vague, than the exact answer to the wrong question, which can always be made precise.”
In evaluation, we use evaluation questions to focus our inquiries. Deciding on evaluation questions is arguably the first and most important step of an evaluation, since it sets agenda (and thereby the budget and timeline) for the rest of the process. Questions are upstream of methodology and methodology exists to answer questions. If methodology is applied epistemology, then answering specific factual questions is what makes it applied. The wording of evaluation questions is frightfully important, since it can make the difference between an evaluation that is feasible and one that is impossible.1
My argument in this post is that far too many well-resourced evaluations fail to ask the right evaluation questions or fail to answer the ones they do ask.
The final evaluation report for the Moving to Opportunity demonstration program turns out to be a great example of this. As I’ve discussed previously, the MTO program randomly assigned housing vouchers to move families in poverty from poor neighborhoods to wealthier ones, then tracked a control group over a a couple of decades. This was a good evaluation in many ways, which is why I’ve chosen it as an example to dissect here. If I’m arguing against current evaluation practices, I don’t want to waste your time pointing obvious problems – I want to show that even some of the best work in evaluation is seriously flawed right now.
What are the evaluation questions?
The final evaluation report for MTO focuses on three “key” questions (pp. 23-24):
What are the long-term effects of a housing mobility program intervention on participating families and their children, and how did these effects evolve over time?
What are MTO’s long-term effects on those children who had not yet entered school when the study began?
What are the mechanisms through which MTO affects long-term outcomes?
There is already a lot to notice in these questions. First, you may recall from my previous posts on MTO that the intention of the 1992 US Congress in funding the MTO study was to document “the long-term housing, employment, and educational achievements of the families assisted under the demonstration program.” If these marching orders had been translated into evaluation questions they would have looked something more like this: “What are the effects of MTO on the housing, employment, and educational outcomes of participants?” by the time of the final MTO study, the question had become about “effects” in general, which turned out to include health outcomes and others. The second evaluation question has a similarly broad scope, in looking for all the “effects” but focused on young children. Unfortunately, the second evaluation question is fully entailed in the first evaluation question, since the first question explicitly asks about effects on children. The third question about mechanisms is certainly a worthy one, particularly from a realist evaluation perspective.
What would it take to answer these questions?
The first two evaluation questions ask us to find the “effects” of the program. Without any kind of limitations on what those effects might be, we have to conclude that the entire set of potential effects is included. For some types of interventions, we could argue that the intervention itself limits the sort of effects we might see (e.g. very minor educational interventions), but not in an intervention that makes such a dramatic change in the lives of participants. Moving across town could change almost every facet of one’s life, from relationships, to school, to jobs, to exposure to pollution. By broadening the initial mandate to focus on housing, employment, and education, the MTO evaluators created a situation in which, to answer their evaluation question, they needed to measure as many outcomes as possible and this would still never be enough. The idea that it is possible to simply observe all the effects of an intervention ignores the role of both instrumentation and theory and knowledge construction. "Effects" aren't just waiting around for us to observe them, we need to hypothesize what to look for and use specific methods for creating observations. This is core to postpositivist, scientific realist epistemology. The alternative is naïve realism.
To answer the third evaluation question, we would need a way of identifying potential mechanisms and triggering them. As Pawson and Tilley (1997) say, in order to learn about mechanisms, we need to be "in a position to manipulate an experiment to create conditions for previously identified potential mechanisms to be triggered by the measures introduced" (p.87). The MTO study, due to its relatively simple experimental design is not really in such a position. We have one control and two treatment groups, with treated participants either given a general Section 8 voucher or a special Section 8 voucher to move to a low-poverty area. This manipulation allows us to isolate one mechanism, namely whether moving triggers the positive outcomes or whether we need to move to a low-poverty area specifically to get these effects. Unfortunately, there was another variable that differed between the two experimental groups, namely that the participants who got the special voucher to a low-poverty area also got "mobility counseling" which the evaluators valued at about $4,500 per family. This is too large of a benefit to ignore, so we can't actually isolate the effect of moving to a low-poverty area versus moving in general.
How did they try to answer these questions?
The final MTO evaluation took a crack at answering questions one and two using an exploratory frequentist design. They chose a large number of constructs to measure and tested for differences between control and treatment groups using standard null-hypothesis significance testing. The limitations section is decidedly sunny-side up, "Although we have sought to maximize the scientific quality of our long-term MTO follow-up, it remains possible that we have missed some important MTO impacts" (p.259). Read: maybe the program was even better than we thought! However, the limitations section doesn't contain any cautions about the false-discovery rate, that is, the probability of getting a significant p-value even though the results were in fact not significant. This omission is a big problem for the MTO study, since there are a lot of significance tests at the 𝝰 < 0.05 threshold. I stopped counting at 800 tests in the final report because I have better things to do. The classic frequentist solution to this issue is to correct p-values for multiple comparisons, which the study does not do. (The results section tempts fate further by flagging results at p < 0.1 as marginally significant.) In a frequentist framework, means that there is an incredibly low chance that the overall results of the study are correct.2 In fact, after just the first 25 tests, the study was more likely to be wrong and right, and after the first 100 tests, the study is nearly certain to be wrong. Needless to say, from a frequentist perspective we shouldn’t trust the results.
As for question three, concerning mechanisms, the MTO study adopted the strategy of testing hundreds of hypotheses and using the results to rule out mechanisms that weren’t at work. Here’s how they explain this strategy:
…because MTO changes multiple neighborhood attributes simultaneously, isolating the effects of specific mechanisms is complicated. However, the study design nonetheless can help us at least rule out some possible mediating mechanisms. If, for example, MTO has long- term beneficial effects on the mental health of female adults, yet has no detectable effect on access to health care services including mental health care, the pattern of findings would provide some evidence against the importance of that mechanism. (p.24)
This seems like it could work until you ask whether the study had the statistical power to detect these effects. The authors acknowledge that the study was underpowered, “Perhaps the most serious scientific challenge with this study concerns its statistical power…” (p.259). This is the problem of false negatives (Type II error) as opposed to false positives (Type I error) which we confronted while thinking about multiple comparisons above. If our whole strategy to understanding mechanisms relies on finding significant effects that we don’t have the power to detect, then we can’t actually answer evaluation question number three either. Moreover, if you are familiar with mediation analysis, you know that the mediator can still be a pathway for the treatment to have its effect even if there is no significant difference on the mediating variable between treatment and control groups. This is the case even when the study is adequately powered and has to do with factors like the size of the indirect effect (mediator) relative to the total effect and the complexity of the mediation model.3
What could we have done instead?
We have already seen that a problem with evaluation questions one and two is that they require us to look for an unbounded set of effects. This is an easy problem to fix: we should specify in advance what effects we are looking for and focus our available resources on looking carefully at those effects. (This is related to the issue we explored previously about a lack of criteria.) This may mean that we do some initial poking around in the literature or using open-ended methods like ethnography before we settle on evaluation questions. But, one of the evaluation questions should not amount to “What are we going to observe or measure?” That is an important question, but it isn’t an evaluation question.
The most answerable evaluation questions are those which ask us to update stakeholder beliefs. For example, "Did the program improve housing conditions, employment opportunities, educational attainment?" implies that the stakeholders are open to a variety of possibilities: that the program was better, the same, or worse than doing nothing. These beliefs can be formalized as a weak prior centered on zero spreading out towards a wide but finite range of positive and negative outcomes. Our evaluation will take this prior belief into account, add data, stir with a Bayesian spoon, and update stakeholders into a posterior belief. Evaluation questions like "What are MTO’s long-term effects on those children who had not yet entered school when the study began?" don't really imply a prior distribution of beliefs to build from, since they don't identify any effects about which to have beliefs. We might be tempted to argue that the prior distribution implied here is the principle of indifference, that is, that all potential "effects" are equally likely. However, this doesn't really make rational sense: it is not equally likely that the MTO program will result in increased graduation rates and that it will create a black hole. (Besides, we can’t distribute credibility evenly across an open set anyway.) The same issue applies to a general search for "mechanisms" – we just end up drawing up a list of plausible mechanisms and testing those, because this is the information required to form an answerable question.
A more defensible way to handle the mechanism question would be to attend closely to variation in the relationship between communities (since altering these is the main treatment), subsamples, and outcomes. This is the standard realist evaluation approach (looking for context-mechanism-outcome, CMO, differences). The final evaluation effectively treats all the communities in the treatment condition the same when they are not – a fact which we learn from the qualitative study of the MTO evaluation. The design of the MTO study would only allow us to observe these changes cross-sectionally, but we could theoretically have introduced more experimental manipulations to handle this as well, such as creating subgroups to send the voucher recipients to areas with identical schools versus better schools, identical employment rates versus better employment rates, and so forth. This is just more treatment arms that can be collapsed into a single group for the main analysis if needed.
To handle the issue of multiple comparisons, switching from a frequentist to a Bayesian framework would be very helpful. Bayesians generally don't do corrections for multiple comparisons because we don't buy the standard frequentist metaphysics that we need to account for the number of tests the analyst intends to conduct. Instead, when we have a belief that the data generation process is noisy, we adjust our priors to be more conservative and shrink parameters of interest closer to zero. No corrections are required and we don't embrace an incentive structure set up to punish inquisitive analysts, as Kruschke argues.4 This doesn’t mean we should accept the results of the MTO evaluation at face value, however, since these results don’t have either frequentist corrections or Bayesian skeptical priors. Without these improvements, they’re probably just wrong.
The way we answer questions should be direct, not convoluted. In my view, this knocks out frequentist “double negative” answers:
Stakeholder: Did the treatment have an effect?
Frequentist: Well, the data weren’t consistent with the hypothesis of no effect.
Stakeholder: Huh?
Bayesian: Excuse my colleague. Yes, we found strong evidence of an effect.
To form the evaluation question using statistical language, we could ask: “Given the available information, what is the probability that the treatment had an effect?” Frequentists can’t actually answer this question symmetrically because their answers don’t come in the form of probabilities, but in the form of likelihoods (i.e. the probability of seeing the data we do conditional on the null-hypothesis being true). Bayesians can actually answer the question directly, e.g. “99% probable.”

To summarize, we are looking for methodological symmetry between evaluation questions and the answers we provide. Only answerable questions are allowed. We shouldn't swap an easy question for a hard one in providing our answer. We need to answer the question we are asked, and our answer shouldn’t be convoluted. These simple norms guide us towards certain methodologies, like Bayesian inference. They also orient us towards a more philosophical focus on the logical entailments of evaluation questions, which should be well-formed. A consequence of this is that evaluators need to be part of the process of forming these questions, not downstream, since the evaluators can help stakeholders engage in dialogue to get to well-formed questions. Another consequence is that many projects will need to engage in a period of general research before the evaluation takes place in order to gather the information we will need to ask an answerable question.
Since I’ve talked about goal-free evaluation in previous newsletters, I want to address the question: does GFE have evaluation questions? The answer is that it does, it just doesn’t use stakeholders goals to formulate them. Like evaluation criteria, they are formulated by the evaluator.
Astute readers may have noted that the MTO evaluation says "Standard errors were adjusted for family clustering" (p.13). This refers to correcting results for the fact that literal family members were in the study and does not refer to "familywise error" which is an issue of multiple comparisons.
Oddly, the evaluation doesn’t even fully commit to the strategy of using significant differences between treatment and control groups to identify mediators. “We find little effect on most health mediators, except for safety and stress, which makes us think that safety and stress could potentially be the key mechanisms for the effects on obesity and diabetes that we observe for adults. At the interim report, we saw various effects on exercise and eating of fruits and vegetables of moving to lower-poverty neighborhoods, whereas at the final evaluation we only see an impact on vigorous exercise for the experimental group.” I did a double-take when reading this paragraph: even though the treatment group exercised significantly more than the control group, the evaluation claims that changes in safety and stress are the mechanisms for declining obesity in the treatment group, but doesn’t call exercise a potential mediator. In the treatment-on-the-treated group, participants were significantly more likely to exercise, at a rate of 9% higher. The same group saw a 9% lower difference in BMIs over 35, also significant. This is not a serious analysis but it cries out for one.
See Kruschke's chapter 11 in Doing Bayesian Data Analysis for an illustration of how analyst intentions cause p-values to jump around.