New & Noteworthy

Decorative image with the title "Evaluation Methods Quarterly"

Evaluation Methods Quarterly: Randomization is Underused in Program Evaluation

One of my observations over my years of working with many nonprofits and government agencies is that randomization is underused in data collection, leading to increased costs and slower projects. This is particularly troubling from a historical perspective, considering that randomization is one of our oldest and most trustworthy shortcuts to getting unbiased information. To combat the underuse of randomization, organizations like J-PAL (the Abdul Latif Jameel Poverty Action Lab) and Innovations for Poverty Action (IPA) have been advocating for its increased use in program evaluations for the last two decades.

The logic and practice of randomization is very well understood by practicing researchers, including many evaluators with quantitative training. And yet, when I first start working with an agency, I almost never find any examples of randomization in use for data collection. Instead, what I usually find is a zero-sum strategy towards data: agencies either collect data from 100% of participants 100% of the time on some variable or they don’t collect anything at all. The result is the expensive patchwork of bright spots and blind spots that characterizes the knowledge state of most organizations. Sound familiar?

In this post, I’m going to point out three scenarios in which randomization is straightforward to implement and could save an organization money in data collection. Contrary to what many people believe, the use of randomization is not limited to probability samples in public polling, which tends to be very expensive and – paradoxically – often ends up being so biased we need additional modeling to correct it. In fact, randomization is a tool that is useful for a much wider range of applications to assist our reasoning.

Scenario #1 – Sampling Within in Large Programs: Some large programs we evaluate serve hundreds of participants per year and gather extensive data about all of them. This amounts to thousands of staff hours of asking questions or waiting for participants to complete self-administered questionnaires. In human services programs, staff time is often the biggest expense in the budget, so reducing the amount of staff time spent collecting data often represents a major savings (it can even cover the entire cost of conducting an evaluation). While some of the questions might need to be asked of 100% of participants for administrative purposes, like adding new participants to the system or understanding their needs, the questions asked for evaluation purposes probably don’t need to be asked of everyone. While the exact proportion will vary depending on the number of participants in the program, for statistical reasons, with sufficiently large samples we usually don’t want a sample larger than 10% of the population to which we are trying to generalize – in this case, the population of program participants. Suppose that the program sees 1000 participants and they complete paperwork by being asked a series of questions by program staff with iPads – we can program an algorithm to “spin” a digital roulette wheel and use a short survey for a random selection of 900 of these participants and a long survey for 100 of them. If each person needs 10 minutes for the short survey and 25 minutes for the long survey, then randomization saves us 225 staff hours each data collection cycle, or about five and half weeks of staff time.

Scenario #2 – A/B Testing a New Idea: Managing change in organizations is challenging. There are many forces that drive program managers to take decisive action and “look confident” while doing it. One common tendency is to dedicate time to deciding how best to implement a new idea and then change 100% of operations from the old strategy to the new strategy. If we are lucky, before and after data are collected. However, other causal factors inevitably complicate our judgments about the before and after snapshots: a new crop of clients with different needs show up, there is a staff or a rule change, the political climate shifts. We find that our before and after comparison is meaningless and that gathering these data was a waste of time and money. What could we have done instead? We could have used randomization to assign 50% of the participants (or staff, or interactions, or whatever our units are) to the new condition and keep 50% in the old condition, then gathered simple yes/no data about success rates in each condition. After running the experiment (called an A/B test) for a couple of weeks, I usually know enough to calculate a few crucial statistics: the proportion of success for each condition (conversion rate), the percent increase in success for the experimental condition (lift), and the probability that the new idea is better than the old idea (posterior conversion rate). Now we have empirical evidence that we are on the right track. All we needed was to randomization and the patience not jump with both feet towards our exciting new idea.

Scenario #3 – Randomized Waitlist Control: Some programs that we evaluate are functioning at peak capacity or slightly over capacity, with more participants than there are program slots. When this is the case, program administrators often feel a pressure to treat people on a first-come-first-served basis or to prioritize people with the highest needs. However, pushing the program to treat people beyond its capacity often results in a decline in the quality of services and can have long-term negative consequences like staff turnover. If we are willing to randomize and exercise patience, we can learn a lot about the program under these conditions. Once the program hits capacity, we can start randomizing new participants into treatment and waitlist conditions, regardless of other characteristics, and then gather baseline data from participants in both conditions. For example, if participants spend two months on the waitlist on average, then we can compare this waitlist control group to participants who received two months of treatment during the same period. Then, once more spots in the program open up, the participants in the waitlist control group get treatment. This way, the control group for the study does not deny treatment to participants, since everyone gets treatment eventually. In other words, using randomization, we get a free control group which serves as a statistically valid sample to compare to our treatment group.

These are cases when randomization would greatly benefit programs. What about the contrary? I would be remiss if I failed to mention that there are times when we shouldn’t use randomization. We shouldn’t use it when doing so would result in the denial of critical services to a person who would otherwise be entitled to receive them. The exception to this rule is a clinical trial, in which people consent to potentially being in a control or placebo condition. The difference is consent. We also shouldn’t use randomization to make decisions for which there are very different risks and payoffs. For example, we shouldn’t randomly decide which programs to fund or defund, since different programs provide very different levels of benefits to society.

But what’s so special about randomization to begin with? The most import thing about it is that tends to produce a balanced, representative sample, including on variables about which we have no information. Truly random samples will be representative (on average) but representative samples are not necessarily random. This allows for meaningful comparisons between groups without making other adjustments (although we are still free to make adjustment if they help). Second, randomization removes many potential sources of human bias. By flipping a coin or allowing a computer to do the equivalent of this, I remove any biases I may have from the assignment of people to groups. I use randomization all the time in my work as an evaluator (in big and small ways) to make unbiased decisions between strategies that have comparable expected utility. One final thing to know about randomization is that it doesn’t inherently cost money or take a lot more time. In fact, if the techniques I just described are used well, they will tend to save the program money and time in the long run. A little planning, a little patience, and single coin to flip go a long way.

“If you can’t measure it, you can’t improve it.“

– Peter Drucker

Let’s Connect