Substack seems like a good place for academics to post their conference presentations and remarks. People who attend your talk can review your presentation and see your sources. People who couldn’t attend your talk (most people?) can access and use it. Most conference presentations don’t get published anywhere, and I wonder about how much knowledge we are throwing away every year by not publishing our conference findings.
I delivered these remarks on October 24, 2024 in Portland, OR at the annual American Evaluation Conference. I wrote an additional post explaining the concept of realism about latent variables, which is probably the most philosophically challenging concept I discuss.
In this presentation, I would like to spell out some of the philosophical commitments entailed in measuring subjective experiences such as meaningfulness. Acknowledging these philosophical commitments can help us to be clearer communicators about the process of evaluation. Once I’ve laid out these commitments, it will be possible to use them to state at least one reason why we should strongly consider measuring subjective experience in evaluations.
As a bit of preamble, we need to introduce a few terms. First, I will refer to subjective experiences as “ontologically subjective” following Searle (1992). Things are ontologically subjective if their existence is dependent on consciousness (Maul, 2017). The meaningfulness of museum visits is a good example of an ontologically subjective attribute. Things that are ontologically subjective can still be ‘real’ – in the way that pain is real – which is to say that the realm of the real is comprised of both ontologically objective (i.e., material) and ontologically subjective things. The second term I need to introduce is the idea of ability, severity, or construct level. This term, usually summarized as the letter theta in psychometrics, refers to the construct score for a particular person that we estimate from their observed responses to a test, task, or questionnaire. Theta is usually the target of our inference in standardized testing, but sometimes in evaluation we ignore theta entirely and only talk about the summary statistics for items. With these terms on the table, now we can move on the main argument of this presentation.
The first commitment entailed in measuring subjective experiences is realism about constructs. On this view, constructs like meaningfulness really exist and are not simply produced by our attempts to measure them. Another way of putting this, as Deborah Bandalos does, is that if the scale that measures meaningfulness were to cease to exist, meaningfulness itself would still exist (Bandalos, 2018, p.315). While this might seem straightforward to some of us, there are some alternative ontologies. For example, some versions of logical positivism would hold that ‘meaningfulness’ is a purely theoretical term that has no direct relation to observation sentences, except through a constructed dictionary of correspondence rules (Suppe, 1977). Another anti-realist position would be instrumentalism (Toulmin, 1953), which would say that meaningfulness is just a theoretical term that evaluators keep around because it allows us to usefully predict and control our environment. Neither logical positivists nor instrumentalists hold that constructs need to be real for us to measure them or use them in evaluations – their existential import is beside the point. However, today I want to argue that realism about constructs is entailed in the idea of measuring them. There is a simple reason for this: measurement requires that the measurand be a real entity. Saying: “I measured a construct but it isn’t real”, makes as much sense as saying that my imaginary friend is two meters tall. Why is this? Measurement is a special form of comparison. Minimally, it requires that the procedure for measurement be distinct from the measurand – a meter is distinct from any particular meter-long space. This means that we count at least two entities – a measuring device and a thing to be measured. In the case of my imaginary friend, we only count one entity, the meter stick. So, to measure meaningfulness, it must really exist.
The second philosophical commitment entailed in measuring subjective experiences is the requirement that the latent variable is causally responsible for variation in the indicators we have chosen to measure. If there is something called ‘meaningfulness’, then it is at least partially causally responsible for the person agreeing with the statement ‘I felt a sense of wonder or awe’ when they reflect upon their museum visit. Note that we only need the construct to be partially causally responsible for this observed response and not fully responsible. In many cases, these construct-irrelevant causes won’t contribute variance in the observable outcomes. When they create variances in the observables, the model will fit as long as the variation attributable to these other causes can be allocated to the error term or partialed out. Also note that these other construct-irrelevant causes of the statement, including the participant’s ability to read and their having attended the museum at all, aren’t just a nuisance – they are needed for measurement to take place. The requirement that the construct be causally responsible for the observed indicators is entailed in the semantics of the latent variable model. If the causal errors run the other way such that indicators constitute the variable we are trying to measure, we need to apply different mathematics – in particular a different correlation matrix – and the procedure is called component analysis (Bandalos, 2018, p. 315).
The third commitment entailed in measuring subjective experiences is a high degree of intersubjective similarity in the construct between persons. That is, my experience of meaningfulness in museums must be fairly similar to yours if we are in the same statistical population. This commitment is particular to the measurement of ontologically subjective attributes because the unit of measurement for these attributes is derived from comparing multiple people. In Rasch measurement, this unit of measurement is called the logit. When persons have a low level of intersubjective similarity in the construct, they respond very differently from each other and do not fit the model. This in turn causes the standard error of the thetas for those persons, to become inflated, reflecting high uncertainty for that person. So, when we have low levels of intersubjective similarity in the construct between persons, this is like having a ruler made of stretchy rubber instead of wood – at certain levels of elasticity we can no longer claim to be measuring anything, even if we keep trying to hold it up to different objects.
A fourth commitment entailed in measuring subjective experiences is variation in the level of the construct. People in the same sample must experience different levels of meaningfulness in order for us to measure it. Once again, this philosophical commitment is particular to ontologically subjective attributes because the unit of measurement is derived from comparing multiple people. If everyone in the sample experiences the same level of meaningfulness, then they will respond in the same way on our scale. Without variation in responses, no item difficulties or person abilities can be estimated, and the model collapses. Constructs with miniscule amounts of variation also perform poorly and have very inflated standard errors, leading once again to the rubber ruler problem.
A fifth commitment entailed in measuring subjective experiences is that no person in the population is situated at a unique value of the construct. This requirement is again specific to the measurement of ontologically subjective attributes because we need a plausible account of the source of measurement error. There is an important distinction between classical test theory and latent variable theory here: in classical test theory, measurement error comes from the “stochastic subject” interpretation, which states that there is variation in observed responses due to indeterminacy at the level of the individual person, while in latent variable theory measurement error comes from sampling only a small number of the subpopulation of people who possess a specific level of the construct. The classical test theory version of this assumption rests on a counterfactual story in which we repeatedly test the same person over and over and erase their memory between tests – impossible stories are not a good basis for measurement theory, so we have to do better than this. The latent variable account of the origin of variation in observed response patterns is actually plausible, but with it comes this assumption about non-uniqueness of construct scores in the population.
Let us take stock of the requirements for measuring subjective experience we have listed so far:
The construct must be real
The construct must be at least partially causally responsible for the observed responses
There must be high levels of intersubjective similarity in the construct between persons
There must be variation in the level of the construct between persons
And, there must not be unique values of the level of the construct in the population
There are some other ontological requirements that we might name, but these are the major ones entailed in measuring latent variables. Additional to these ontological requirements for measuring subjective experiences, we need to tack on the epistemic requirement that all of these ontological characteristics are knowable by people who are not themselves experiencing them. This means that the evaluator must be able to know that the construct is real, that it has high intersubjective similarity, and so on. Some of these knowledge claims are testable using psychometrics and some are not. There is nothing in psychometrics that allows us to test whether a construct is real – in fact, it is possible to fit a Rasch or factor model to responses to nonsense items (Maul, 2017). However, there are procedures to test intersubjective similarity of the construct that go under the name of invariance or differential item functioning. Luckily, we are not confined to using psychometric models as our only research methods, so we can use other methods to establish the truth of some of these requirements, such as using qualitative methods to establish the existence of some construct.
The crucial point here is that these are preconditions for the measurement of subjective experiences rather than the so-called assumptions that every user of statistics is accustomed to violating. If these preconditions are not met, there are no alternative procedures that can allow us to measure subjective experiences anyway. As we think about the ontologically subjective attributes we may potentially wish to measure during an evaluation, we should think through whether these conditions can really be met. I propose that three big questions can help us here:
Is this construct actually experienced in some small or large way by everyone in the focal population or did we invent it, perhaps as a matter of bureaucratic shorthand?
Are we likely to find variation in the level of the construct or just variation in the way that people talk about it or experience it?
Does this construct have the power to cause all of the things that we think are its calling cards or are some of these observed indicators actually causing it?
To close, I want to suggest that, if we can actually satisfy the preconditions for measurement of subjective experience, there is at least one good reason for us to strongly consider trying to measure it and measure it well. If an ontologically subjective construct is real, has the causal power to give rise to several observable indicators of itself, and we expect to find real and non-unique variation of it in the focal population, then this construct is at least one way in which participants differ from one another. If we think of participants as people, in terms of clusters – as in unsupervised machine learning – rather than as heterogenous bundles of mere characteristics that may or may not be related to the outcome, then we can start to see why this matters. We often treat demographics as though they can explain outcomes but this is almost never the case in a causal sense. The fact that someone is Asian and a woman does not causally explain why they might enjoy museum visits more than someone who is white and a man. These demographics may be relevant for other purposes, but they lack direct explanatory power, and from a causal perspective are better considered as proxies for unmeasured subjective experience. The continued use of demographics as proxies for subjective experience is arguably partly to blame for a lack of progress in equitably transforming our cultural institutions, since such demographic comparisons do not provide even the slightest hint about how to remedy the conditions that lead to unequal participation and outcomes. People are different from each other in many ways, but we will only scratch the surface of those differences – and understand the difference that differences make – when we take up the task of measuring subjective experience.
References
Bandalos, D. L. (2018). Measurement theory and applications for the social sciences. Guilford Publications.
Maul, A. (2017). Rethinking traditional methods of survey validation. Measurement: Interdisciplinary research and perspectives, 15(2), 51-69.
Searle, J. (1992). The rediscovery of the mind. Cambridge, MA: MIT Press.
Suppe, F. (Ed.). (1977). The structure of scientific theories (Vol. 634, No. 8). University of Illinois Press.
Toulmin, S. (1953). The philosophy of science. London: Hutchinson.
Well done presentation