Carol Weiss and Program Theory
"Theory-Based Evaluation: Past, Present, and Future" by Carol Weiss (1997)
Imagine that you are evaluating a program that works, but not for the reasons that its designers believed that it would. It is being correctly carried out and it’s getting good results. Yet, when you look at the mechanisms that should be causing this change, they aren’t firing.1 You test some alternative mechanisms and find that these are actually more likely to be operative. How can we more precisely describe this situation?
Carol Weiss would draw a distinction between implementation theory and program theory.2 She would say that the program theory was the issue here, not the implementation theory.
To understand this, first, let’s get Weiss’ definition of implementation theory:
“Implementation theory focuses on how the program is carried out. The theoretical assumption that it tests is the assumption that, if the program is conducted as planned, with sufficient quality, intensity and fidelity to the plan, the desired results will be forthcoming.” (p. 46)
Most evaluations focus on implementation theory by default because they don’t gather enough details about the mechanisms supposedly at work in the program to determine whether these work or not. There might be a general statement about how the program is supposed to work, but this may be represented as a high-level logic model with inputs, activities, outputs, and outcomes. Logic models are useful points of reference to help everyone pin down the basics of what a program is supposed to do, but they usually aren’t specific about mechanisms, so we can’t call them theories.
Now, let’s compare this to Weiss’s definition of program theory:
Programmatic theory, on the other hand, deals with the mechanisms that intervene between the delivery of program service and the occurrence of outcomes of interest. It focuses on participants’ response to program service. (p.46)
Theory-based evaluations are those that actually articulate and test programmatic theory. They are rarer because they require us to collect a lot more information about the processes that lead to the outcomes.
My preferred way of contrasting implementation theory and programmatic theory is in terms of black boxes: implementation theory can be tested while treating the program as a black box, but programmatic theory cannot. Programmatic theory needs to be able to articulate what is happening in the box.3
Why do we care about mechanisms?
If a program works, does it matter why? The answer to this question depends on whether we just want to render an evaluative judgement or whether we are on a specific mission to improve it, adapt it to another specific context, or scale it. If you aren’t interested in any of these things, then by all means, ignore mechanisms.
Even small adjustments to the program often send you searching for evidence about mechanisms. Suppose you are evaluating a program with many parts. Between funding cycles, the program administrators want to propose scaling up certain parts of the program and pruning back other parts – they don’t even want to eliminate any parts of the program. They ask you whether this is a good idea or whether this will break the program. If you don’t have any evidence about the mechanisms of the program, you won’t know the answer.
In my own evaluations, I’ve sometimes found that participants are seeing the greatest impact from aspects of the program that the creators didn’t think were critical components and that the centerpiece of the treatment wasn’t actually doing anything at all. This is more common than you might think.
Carol Weiss’s Future
I have a soft spot for any writer who boldly titles a section of a journal article “The Future” like Carol Weiss did. She believed that theory-driven evaluation would improve not only evaluation, but also program design, since sitting down with program administrators and eliciting detailed theories would help them “confront the leaps of faith and questionable reasoning that are often involved” (p.51). She really believed in the science part of social science.
Reading what Dr. Weiss felt were the challenges facing theory-driven evaluation in the late 1990’s brings up mixed feelings for me. On the one hand, she knew that the need to study mechanisms in greater detail would pose major challenges for measurement and particularly the structural analysis of mediation. The advances that she hoped for in this area have indeed come to pass – the canonical texts she cited on mediation in her article (Baron and Kenny) has now been surpassed, and structural equation modeling has made huge leaps. On the other hand, I can only sigh as I read her plea for better program theories:
“Evaluators are currently making do with the assumptions that they are able to elicit from program planners and practitioners or with the logical reasoning that they bring to the table. Many of these theories are elementary, simplistic, partial, or even outright wrong. Evaluators need to look to the social sciences, including social psychology, economics, and organization studies, for clues to more valid formulations, and they have to become better versed in theory development themselves. Better theories are important to evaluators as the backbone for their studies. Better theories are even more essential for program designers, so that social interventions have a greater likelihood of achieving the kind of society we hope for in the twenty-first century.” (p.51)
Judging by the tone of this critique, I am not convinced that Carol Weiss would be impressed by the slow progress in this area. I share her high standards for the discipline and I am not.
She imagined two parallel paths forward. First, the application of theory-driven evaluation to individual programs, and second the use of meta-analysis to test macro-theoretical claims across programs.
Weiss concludes her article:
“As a starting point, we need plausible theories. We need to make the maximum use of logical reasoning, practitioner wisdom, prior evaluations, and social science research to generate program theories and then use our collective evaluation work to test them under realistic operating conditions.” (p.53)
One of my favorite things about theory-driven evaluation is that, in its vocabulary, theory is not opposed to practice – the constraints of actual programs don’t prohibit us from formalizing models about what is going on. People who diametrically oppose “academic” epistemology to the “pragmatic” epistemology of evaluation always end up showing that they misunderstand the theory-practice relationship. We need something like a theory, even if it is a less-specific one, in order to generate any kind of testable hypothesis about the evaluand. There is no escaping program theory – the logic and wisdom of good design – only ignoring it.
Criteria for program theories
As I’ve explained before, evaluation is engaged in a progressive dialectic with decisions, such that all decisions require evaluation to specify the value of alternatives and all evaluations require decisions to specify the evaluation procedure. This evaluation-decision dialectic is what allows us to take the next logical step from Weiss’ idea of program theories, which are essentially models of evaluands, and work up towards an evaluation procedure to compare the value of different program theories, provided we are willing to make a few decisions along the way.
One way of judging the adequacy of program theories is to determine the proportion of variance in outcomes explained by their putative mechanisms. If we have a structural equation model in which a full set of 6 potential mediators explain 30% of the variance in outcomes, then we produce nested models with a nested subset containing only the strongest 3 of these mediators that explains 29% of the variance, then the typical conclusion is that the 3 weaker mediators are not genuine mechanisms. Using such structural models, we can compare multiple program theories to determine the best ones.4
The above is only a small step away from what Carol Weiss contemplated and would not have surprised her at all. The next step, however, might. We can make this evaluation procedure symmetric by specifying our standards for a good program theory. For example, depending on the domain of the program, we might say that a program theory is adequate if its mechanisms can explain 10-20% of the variance in outcomes, poor if it explains less than 10%, and good if it explains more than 20% of the variance in outcomes. Before we specified these standards, we could only make relative claims about ordinal relations between program theories (that some are better than others) but once we have standards we can make consequential claims about distances between theories and the thresholds (that some theories are adequate and others are far from adequate).5
At the same time, if we believe that standards matter, then we need to accept that a good program theory is not enough. Weiss assumes we can identify desired outcomes, but symmetric evaluation asks the more fundamental question: what standards should those outcomes meet? The legitimacy of our evaluative judgments depends critically on having defensible criteria. If I conclude that a program “works” based on arbitrary benchmarks, I've undermined the entire enterprise, regardless of how sophisticated my understanding of causal mechanisms might be. Imagine the extreme case in which we understand perfectly how the program works but we set the standards for success too high and conclude that it doesn’t, even though there is a positive effect - the core evaluative judgment fails.
Theory-driven evaluation is most interested in opening the black box of programs to understand causal mechanisms. This can be very important, depending on our purposes. Symmetric evaluation also wants to open the black box of evaluative judgment to understand how we move from performance data to normative conclusions. Both are addressing opacity, but in different parts of the evaluation process.
Elsewhere, I defined mechanisms as “those causal processes without which the outcome would not occur (at a given magnitude) under exchangeable conditions.”
This distinction is derived from Suchman’s earlier (1967) distinction between implementation failure (the program wasn't carried out as planned) and theory failure (the program was carried out as planned, but the underlying theory about how change happens was flawed).
Scriven has argued that black box evaluation are legitimate kinds of evaluation and I agree. Consider product evaluations of cars: we finds that the 2025 Lexus RC performs worse than the 2025 BMW 3-Series on all our criteria, do we need to open the hood and figure out exactly what’s wrong with the RC in order for the evaluation to be be a good one? The thing we need the most are clear standards, rigorously applied.
If we move away from frequentist SEM towards Bayesian SEM, we get even more benefits here: posterior probability distributions for path coefficients, model comparison via direct probability statements, model averaging, iterative model development, and so forth.
Similarly, we can set standards not just for final outcomes, but for each theorized link in the causal chain. There is no problem with doing this in a symmetric evaluation, since we can simply set benchmarks for “process” variables as well. From this perspective, the distinction between “process” and “outcomes” in classic evaluation theory (e.g., CIPP), is not very important since we need the same procedures to render evaluative judgments about both. This is a flatter, monistic view, that reminds us that intermediate state are not ontologically different than the final states we have decided are crucial from a program theory perspective. The outcome of one process can become the intermediate process of another depending on our theory of change; the process versus outcome distinction is a theoretical rather than material one.


