Multi-armed Bandit for Formative Evaluation

Help programs allocate resources more effectively during the evaluation

Aug 17, 2025

∙ Paid

Note: In this paid post, I'm going to introduce a useful program evaluation method, explain when you would use it, work an example, and give you code you can copy to do the analysis yourself. I do paid posts for a couple of reasons: 1) because certain posts are highly technical and take me a long time to write, and 2) I want to support Substack in its current form as an ad-free platform.

Formative evaluation requires continuous learning and adaptation as programs develop. Traditional experimental testing, while valuable, has a limitation: it allocates participants equally between treatment arms even as growing evidence suggests one approach is clearly superior. Multi-armed bandit (MAB) testing addresses this by dynamically shifting more participants to better-performing treatments as evidence accumulates. This makes MAB testing particularly well-suited for formative evaluation, where the goal is optimizing program effectiveness rather than generating definitive causal estimates.

What is Multi-Armed Bandit Testing?

Multi-armed bandit testing is a sequential decision-making framework that balances exploration (testing different treatments) with exploitation (using the best-known treatment). Paying close attention to the explore-exploit tradeoff is a good general strategy for doing formative evaluation because we need to be able to make decisions as the landscape rapidly changes. Rather than throw our hands up and say “it’s complicated!”, MAB gives us a set of algorithms to manage this tradeoff.

The key advantage for formative evaluation is ethical and practical: participants are increasingly likely to receive better treatments as the experiment progresses, while you still gather sufficient data to identify the best approach.

Multi-armed bandit algorithms have found applications across diverse fields since their early development in medical trials (Thompson, 1933). The healthcare sector has been particularly receptive to bandit approaches as part of adaptive trial designs that can modify treatment allocation based on accumulating evidence (Chow and Chang, 2008). Beyond medicine, MAB methods have proven valuable in web optimization (Li et al., 2010), economic applications including dynamic pricing strategies (Boer, 2015), and recommendation systems (Bresler et al., 2014).

Thompson's original insight remains foundational to understanding why MAB approaches are valuable for formative evaluation:

"...there can be no objection to the use of data, however meagre, as a guide to action required before more can be collected; although serious objection can otherwise be raised to argument based upon a small number of observations. Indeed, the fact that such objection can never be eliminated entirely—no matter how great the number of observations—suggested the possible value of seeking other modes of operation than that of taking a large number of observations before analysis or any attempt to direct our course. This problem is ... directly concerned with any case where probability criteria may be established by means of which we judge whether one mode of operation is better than another in some given sense or not." (Thompson, 1933)

In other words, we need to act on incomplete information while continuing to learn, but small sample sizes is always going to be an issue, no matter how big or small. Thompson is asking: what if we sacrifice a little rigor and let sample sizes vary so that we can get the advantage of getting the right answer faster?

Keep reading with a 7-day free trial

Subscribe to Program Evaluation to keep reading this post and get 7 days of free access to the full post archives.