Why Use Validated Measures? A Technical Explanation
Some facts and a simulation to illustrate them
Note: In this paid post, I’m going to introduce a tricky concept that matters for program evaluation and work a detailed example via simulation. I do paid posts for a couple of reasons: 1) because certain posts are highly technical and take me a long time to write, and 2) I want to support Substack in its current form as an ad-free platform.
Introduction
The choice between validated measurement instruments and locally-constructed, bespoke tools represents a fundamental methodological choice in program evaluation. Many of my evaluations use a mix of both. Validated tools offer documented psychometric properties, but their application in new contexts often raises questions about whether they are worth using. This essay examines the empirical and theoretical advantages of validated tools, their role in enabling comparative inference, and the specific risks introduced when measurement is improvised.
What makes a validated tool?
Validated tools have undergone systematic development processes that expose and correct construct underrepresentation and construct-irrelevant variance. Factor analytic work, differential item functioning analyses, and convergent/discriminant validity studies progressively refine item sets to align with constructs. Messick’s (1989) unified validity framework established that construct underrepresentation occurs when the assessment is too narrow and misses essential dimensions of the construct, while construct-irrelevant variance contains excess variance associated with other distinct constructs.
The Rasch framework, which I use, goes beyond factor analysis by providing detailed item-level and person-level diagnostics.1 Item fit statistics, particularly infit and outfit mean square values, reveal items that introduce construct-irrelevant variance by behaving unpredictably relative to the latent trait. Items with high infit (>1.3) may be measuring something else entirely for a subset of respondents, while items with low infit (<0.7) may be redundant or overly predictable. These diagnostics operate at the item level, making them more actionable than factor loadings alone. A Wright map simultaneously displays the distribution of person abilities and item difficulties on the same linear scale, revealing construct underrepresentation as gaps in the difficulty continuum where no items assess particular trait levels.
The Rasch framework distinguishes measurement (estimating person locations given calibrated items) from calibration (estimating item difficulties from a sample). This the point of specific objectivity (see Note 1). Validated measures represent pre-calibrated item banks where difficulties are known, converting the estimation problem from joint estimation to conditional estimation given fixed item parameters. This is very useful, and it is an architectural difference and not just a statistical convenience or assumption.
Validated tools are constantly going through this cycle of calibration and correction, with evidence for their validity accumulating in journals. Sometimes, old tools are found to be invalid for certain uses or populations and psychometricians publish about this too. It’s just as much of a splash to take down a big measure as it is to develop a new one. Ad hoc tools lack this iterative correction mechanism, making them vulnerable to a host of practical problems.
One of the ways we discover common practical problems during the validation process is by gathering qualitative data. Validated instruments typically incorporate structured expert review, cognitive interviewing, and pilot testing to ensure item comprehensibility and construct coverage. Cognitive interviewing methodology makes processes that are normally hidden available for examination, systematically revealing whether respondents interpret items as intended. Willis and Artino (2013) demonstrated that abstract terms researchers consider clear often suffer from significant misinterpretation, with items like “health professional” eliciting widely varying interpretations departing from designer intent. I find that people who work closely with participants are not necessarily any better at catching these problems with items - if anything it makes them overconfident. The remedy is to pilot the items with real participants.
Keep reading with a 7-day free trial
Subscribe to Program Evaluation to keep reading this post and get 7 days of free access to the full post archives.

