Evaluation Methods Quarterly is a blog series by Dr. Anthony Clairmont, EVALCORP’s Lead Methodologist. Each quarter, Dr. Clairmont explores key methodological concepts, challenges, and innovations in evaluation, offering insights to strengthen practice and advance the field.
There are several truly difficult aspects of evaluation that demand years of patient study to approach with any confidence: valid measurement, causal attribution, and the use of information are among these. I am currently persuaded, however, that the hardest problem in evaluation is setting standards.
To explain why this is the case, I’ll begin by explaining criteria and standards. Criteria are the dimensions of value that determine where we look when we conduct an evaluation. For example, a program to bring bicycle lanes to a city might include improved commute times as a criterion of success for the program or it might include improved health outcomes. We need to be explicit about criteria in evaluation because there are usually many potential dimensions of value and we can only focus on the most important ones. Within each criterion, we set standards for what counts as good performance. For the bike lane example, we might say that the bike lanes are successful on the criterion of commute times if they reduce average commute times for people living in the city by five minutes or more. (There is a further distinction that we might make between standards and goals, since standards typically specify a full range of possible performances from failure to optimal, while goals are concerned with optimal performance only.)
Criteria and standards are part of the fundamental logic of evaluation, which requires that we: 1) define what we are evaluating (the evaluand), 2) set criteria and standards, 3) gather performance data, 4) compare performance to our standards, and 5) make a synthesis judgment about the merit, worth, or significance of the evaluand across all our criteria. Without standards, we cannot make the basic comparison in step 4 that is the crux of the evaluation process. In evaluation, we are only permitted to say that something is “good” if we can articulate “by what standard” it is good. This is different than our everyday judgments because evaluation requires our judgments to be systematic and defensible.
When I say that setting standards is a “hard” problem, I do not mean that it is time-intensive for evaluators or that it requires us to gather huge amounts of expensive data. To the contrary, sometimes we can set standards without using any data at all and the actual process of setting the standards can be done in a day’s work. Rather, I mean that it is conceptually a hard problem, requiring at least elementary knowledge of several branches of philosophy, including ethics, epistemology, and logic in order to render anything like a valid argument for the standards we have set. (Now, of course, standards are set all the time in the absence of valid arguments for them, but as a result the quality of these standards is itself inevaluable.) The evaluator’s task regarding standard-setting is to dialogue with stakeholders so that, through collective deliberation, we can arrive valid standards.
Modes of Reasoning for Setting Standards
The major reason that setting standards is hard is because it generally requires us to engage in three different kinds of reasoning simultaneously: normative, descriptive, and predictive. Normative reasoning concerns that which should be, descriptive reasoning concerns that which is, and predictive reasoning concerns that which will be. These ways of reasoning each take as their inputs different types of data and produce inferences which are not easily synthesized. To understand why, let’s spend some time with each type of reasoning.
Normative reasoning about standards is the most important kind of reasoning, and indeed, is sufficient to reason about standards on its own. Normative reasoning takes as its input claims about values and as its output statements about how the world should be, which philosophers sometimes call “injunctions.” For example, there was a major debate in the early 20th century about how fast cars should be allowed to travel, given widespread auto fatalities. One popular idea was that cars should be outfitted with limiters that would cap their maximum speeds. In this example, our normative inputs are the value of human life, the convenience of fast car rides, and freedom of motorists, while our normative outputs are the injunctions for how fast a car should be able to go. Normative arguments often hinge on the purpose of the evaluand. As another example, the Apollo space shuttles were only “good enough” if they fulfilled their intended purpose, which (post-1961) was to get to the moon. Being incrementally better than the last attempt at space flight was not good enough – either the shuttles had to be able to plausibly make the journey or they were never going to be allowed to launch. Normative reasoning is sufficient for setting standards when we know the balance of costs and benefits that will make it worthwhile to undertake or continue a course of action, such as running a social program, and we know in advance that subpar performance has a poor payoff.
Descriptive reasoning about standards is commonly used in evaluation, particularly when we have information available about the evaluand or similar evaluands. For example, in our bicycle lane case, we could assemble a dataset of changes in commute times from cities who have implemented bike lanes. This dataset would allow us to create a distribution of changes in commute times. One standard that can always be considered is that the evaluand should fare better than average for its class – in this case, we might call the bicycle lanes successful if they reduced commute times by more than the average amount for cities adopting this policy. Notice, however, that this still requires normative reasoning, since there is nothing special about average performance. Perhaps what we really need is for commute times to be reduced by 5 minutes or more, regardless of the average – this is a simple normative judgment. Descriptive reasoning is most helpful for standard setting when it helps us reduce our uncertainty about whether our normative standards are defensible. Learning that our normative standards fall within a range of plausible values of evaluands in the same class reassures us that we aren’t building castles in the sky.
Predictive reasoning about standards is less commonly used in evaluation, since it requires a special data analytic skillset – namely, the ability to create predictive models. Predictive models range in complexity from simple probability simulations that randomly draw values from a single distribution of plausible outcomes to agent-based simulations playing out scenarios of how entities with particular characteristics will behave in different environments. By gathering information about the evaluand and the conditions within which it operates, evaluators who are skilled in predictive modeling can create predictive models of the most likely outcomes for that evaluand. For example, in a few lines of code, it is easy to simulate the number of participants who will complete a social program if we stipulate: the number of applicants, the proportion who will be ineligible, the proportion who will fail, the proportion who will drop out before completion, and the variance for each of these quantities.
As an example of predictive modeling, use this widget to run a simple simulation of the number of participants who will complete a program given the various ways that they might leave and the precision of those estimates. The simulation yields the number of participants we expect to complete the program and 80% and 90% confidence intervals around the prediction.
The use of predictive reasoning is similar to the use of descriptive reasoning. Predictive models cannot replace normative reasoning but they can let us know when our normatively-derived standards are realistic. Predictive models are very helpful when we are attempting to extend our reasoning through many separate steps and processes. Predictive reasoning is even more powerful than describing other evaluands in the same class, since we can introduce assumptions about the particular characteristics of the evaluand and the conditions å which it operates.
The use of any of these three forms of reasoning for setting standards, even on its own, can require considerable work. However, what is particularly interesting to me is how complex it can be to combine these three forms of reasoning. Normative reasoning is the queen on this chessboard, but descriptive and predictive reasoning are the bishops and rooks: you can win with the just the queen if you like, but more powerful strategies involve using a combination of the other pieces. Suppose that we discover a normatively-derived standard is highly unlikely to be achieved according to a predictive model – how should we integrate this information into our standard-setting process? We could disregard the prediction and double down on an outcome with a low expected utility (utility of the outcome times the probability of the outcome) or we could use the prediction to change our standard. Suppose, alternatively, that we discover that, according to our prediction, the evaluand should far outperform all others in its class. Should we set the standard in line with other similar programs (which it may exceed) or set the standard in a class of its own, based solely on our predictive reasoning? The resolution to these questions will depend wholly, I argue, on salient features of the situation in which the evaluation is being conducted. Just like the chessboard, there are no winning strategies a priori, just wise moves – wise because they are contextually appropriate.
Philosophical and Ethical Concerns for Setting Standards
In a certain sense, the combination of information required is merely a practical difficulty. Much more daunting are the philosophical questions involved in standard setting. Moving from facts to values is assumed to be possible in classical program evaluation theory, but the philosophical literature on this topic is vast. Other ethical issues arise as well, particularly when we consider how standards embody implicit decisions about the evaluand, often before it has a chance to exist. In democracies, contestation of fundamental values is a feature of the system, leading to inescapable questions about the limits of pluralism for normatively-derived standards. Philosophical debates about the nature of knowledge also emerge, particularly in our characterization of uncertainty surrounding standards. While many evaluators are prepared in their training to handle uncertainty arising from random sampling (aleatory uncertainty) and a few are prepared to uncertainty arising from measurement error (measurement uncertainty), the uncertainty around standards does not fall into either of these categories. When we are unsure about standards, this is because of our subjective knowledge state, so-called epistemic uncertainty, which is well-treated in some branches of statistics but not commonly taught in graduate programs that prepare future evaluators. More fundamental than all this, I think, is the general philosophical wherewithal required to understand whether stakeholders are talking about description, prediction, or normativity, and how these different forms of reasoning should be formalized in our evaluation design.
Judgement and Setting Standards
Setting standards is a hard problem not because it is procedurally complex, but because it exposes the evaluator and stakeholders to the full weight of judgment, usually before any performance data have been gathered. It requires us to make principled distinctions between what is, what could be, and what ought to be. Then, we must justify those distinctions in ways that are intelligible to others. This is not merely a technical task but an ethical one, demanding clarity of purpose, transparency of reasoning, and quite a bit of intellectual humility. If evaluation is a disciplined reckoning with the value of human and nonhuman efforts, relationships, systems, and futures, then setting standards is where that reckoning begins.