CIPP: Context, Input, Process, Product
A tool every evaluator should know how to use (Stufflebeam)
One of the most common feelings among professional evaluators is that we always arrive on the scene too late. The program has already been conceived, staffed, and funded. The evaluator is invited when it is time to start watching the main action of the program unfold. After a while, work starts to feel like a TV detective show – at the beginning of every new project, you nudge under the crime scene tape, someone offers you a black coffee, and you growl, “What do you know so far?” under the flashing lights. The most important stuff seems to have happened already and you’re just catching up.
There is a better way. Daniel Stufflebeam figured it out. Evaluation has to start earlier.
He figured it out in the 1960’s while evaluating US inner-city schools and called it CIPP: context, input, process, product. In talking with evaluators, I find that they are usually aware of CIPP but that they misremember it as something much less subtle than it actually is. Often, evaluators think that CIPP describes the elements of a program or a logic model, but this isn’t actually correct. CIPP actually describes four different evaluations that happen as part of a larger evaluation of a single initiative, with each of the letters reminding us of the evaluand.
Context
When then evaluand is the context, we are performing a needs assessment. This might seem strange, until you understand that needs assessment is a type of evaluation. In this step, we define the context, identify and assess the needs of the target population, find opportunities to address those needs, and determine if the project goals align with the identified needs. This part of the process is meant to establish a rationale for having a program or product in the first place. In practice, skipping the Context Evaluation often results in dysfunctions like:
programs that look great on paper but can’t get enough participants, even though they are cheap or free
duplicative services
services that conflict with other services
products that sound cool but which no one actually wants to buy
If any of these scenarios sound familiar, then you understand the reason for doing a Context Evaluation and following its recommendations.
Input
One of the things that makes CIPP unique is the special place that it gives to the evaluation of inputs, that is, the options that we have for addressing the unmet needs identified the Context Evaluation step. Evaluating potential choices is an area with which few practicing evaluation consultants appear to be familiar. Some consider this to be like “strategic planning”, but Input Evaluations are much more constrained to a set of options for addressing the unmet needs identified in the Context Evaluation. Perhaps the main thing that makes Input Evaluation seem strange to evaluators is that much Input Evaluation happens in the absence of data: we don’t know what program or product we are about to run, so we obviously don’t have local data. Sometimes if we are lucky we can get data from other implementations of similar programs, but we can’t count on that. The best frameworks to use here will usually be some combination of decision theory, meta-analysis, and rubrics. Skipping the Input Evaluation leads to problems like:
re-inventing the wheel when evidence-based interventions are already available
ignoring obvious alternative options that would be cheaper to implement with no expected loss in quality
overusing the existing skillsets of staff to solve very different problems (they know how to do group therapy? let’s try group therapy again)
having difficulty articulating the theory of change that underlies the intervention
products that are easily outmatched by competitors who did more R&D
Many consequences of jumping past the Input Evaluation fit into the hammer-in-search-of-a-nail category of problem. This is because we don’t investigate alternative means of solving problems and we fall back on what we know how to do, even if this isn’t optimal.
Process
Most evaluators are more familiar with process and product evaluation, so I’ll move more quickly through these parts of CIPP. Briefly, process evaluation monitors the project's implementation by documenting the process, providing feedback, identifying any needed adjustments, and assessing the participants' acceptance and execution of their roles. Stufflebeam’s version of process evaluation thus shares some elements of Scriven’s formative evaluation. Skipping the Process Evaluation gets us into situations like:
the program or product worked or didn’t work but we don’t know why
we don’t know how to improve the program or product
it’s hard to write about the program because we don’t have a lot of evidence about how the program actually works
we can’t check what participants say about how the program works against our own data, so we have to take what staff or participants say at face value
we don’t have enough information to do a cost breakdown analysis of the product or service to determine its main cost drivers
Product
Product evaluation is essentially outcome evaluation. In this step, we identify and assess project outcomes. We measure, interpret, and judge a project's outcomes, determining the extent to which the needs of participants were met. Stufflebeam’s criteria are merit, worth, significance, and probity. Merit and worth refer to the internal and external value of the program, where merit is sometimes glossed as “quality” and worth as “impact to society.” Significance refers to the importance of the results beyond the case at hand, for example, did they demonstrate a treatment that could be applicable elsewhere. Probity refers to integrity, honesty, and absence of fraud, waste, and abuse.1 When we don’t do Product Evaluation, we get into situations like the following:
funders and the community keep wondering what the program has accomplished
not knowing whether the program or product is cost-effective
planners can’t rule out the possibility that the entire budget is being wasted
there is no evidentiary basis to choose between competing approaches to solving the same problem
there is no evidentiary basis to choose which problem to focus on solving given the resources
staff are judged based on adherence to procedure rather than on results
there is no incentive to innovate to achieve better results or products

Improving CIPP
I can’t really say the word “context” out loud without air quotes. This is because I think that actor-network theory (ANT) has the correct ontology for evaluation. In ANT, we say that “context” is just a convenient name for all the crucially important things that make up the parts of the network we don’t want to talk about right now. Things that get stuffed into the “context” sack often include the nonhuman actors like the local environment, infrastructure, economy, and so forth. Having a sharp divide between the “subjects” of our research and their “context” will mess up your thinking and writing about evaluation. However, I’m going to give CIPP a pass on this one because the framework is not trying to separate “context” from anything else; it’s just trying to say that we need to do a needs assessment which of course includes our participants in the frame.
The idea of conducting a “process evaluation” that is unconnected to outcomes has never really sat well with me. Luckily, the CIPP framework is telling us to keep all four sub-evaluations connected. However, even conducting a sub-evaluation called the “process evaluation” as part of a larger evaluation often yields useless results. This is because many process evaluations are basically attempts to vacuum up evidence that implementation is happening as planned and how it is happening. However, none of this really matters unless the process is also a mechanism and we can arrive at a good understanding of the mechanism. Doing a process evaluation always looks inside the black box, but it doesn’t always understand what it sees in there. This is an important difference between CIPP and the CMO (context, mechanism, outcome) framework from realist evaluation.
Adding the two dimensions of significance and probity to the classic “merit and worth” formula seems like an optional move to me. The reason for this is different for each term. For one thing, it’s clear that particular evaluation results do not need to be significant beyond the local context. Some evaluands are for a very specific purpose and we don’t need to generalize them. A medical treatment for a rare disorder doesn’t have to be significant for other disorders, it just has to work. A program that fixes urban poverty in Chicago would be a miracle and we don’t care if it doesn’t work anywhere else. Looking for “significance” is one of the things that separates the modern practice of social science from evaluation, and this is actually a helpful line to draw. Evaluation is about the actual evaluand, not generalizability to other evaluands. The inclusion of probity in outcomes evaluations is much more defensible, since it sets up the evaluator to be on the lookout for fraud, waste, abuse, and dishonesty, as we obviously should be. However, I think that probity is also arguably just a dimension of merit. It’s not coherent to say that a program is very high quality (meritorious) but that it is also a built on a web of lies and graft.
The Wisdom of CIPP
CIPP has aged well. The main reason for this, I believe, is that it got the fundamentals right and left room for growth. Since the 1960’s, we have gotten better at many of the core methods that allow us to evaluate each of the four components.
Context: The theory and practice of needs assessment has developed into its own area of study. You can now fill a bookshelf with works on needs assessments.
Input: Decision theory, complexity theory, and related disciplines are much more sophisticated than they used to be and now we have powerful computers and good software to run their models. At AEA 2024, Peter York, Geetika Pandya, and Michael Bamberger showed how they used machine learning to help stakeholders choose among different possible program designs.
Process: Computers have also vastly improved our ability to collect process data. We now have so much process data we don’t even know what to do with it sometimes. We have to hire entire positions just to structure it.
Product: We have gotten much better at statistical modeling needed to draw conclusions about outcomes. A/B testing is commonplace in most industries (except program evaluation). It is now possible to estimate effects for much more complex models that it used to be. Causal inference and Bayesian inference have made huge leaps.
Perhaps the most important thing about CIPP, however, is the way in which it helps us to frame evaluation questions. In the past, I’ve written about how even very well-funded and high-profile evaluations can end up asking poorly-framed evaluation questions, then waste a lot of resources answering them. Following the CIPP process sets us up to ask better evaluation questions because it gets the evaluation process started at the earliest, most basic level and moves to more advanced levels:
Context: why is there a program or product in the first place?
Input: why was this approach chosen?
Process: how were things done?
Product: did it work?
These questions will be tailored for particular contexts, but as far as evaluation questions go, we could get pretty far by simply copying these.
Probity. Noun [mass noun] formal. the quality of having strong moral principles; honesty and decency: financial probity. Late Middle English: from Latin probitas, from probus ‘good’. Oxford Dictionary of English.