Asking Good Evaluation Questions

Evaluation questions are an excellent focal point around which to design an evaluation, but asking these questions the right way is harder than it seems. A sophisticated analysis cannot rescue a poorly formed evaluation question. Yet evaluators, program managers, and funders routinely find themselves deep into an evaluation before realizing that the questions guiding their work were flawed from the start.

What Makes a Question an Evaluation Question?

Before asking whether an evaluation question is well-formed, we need to establish whether it’s an evaluation question at all. Not every interesting question belongs in an evaluation, and the difference matters.

Evaluation questions are distinct from research questions in a specific way: they either culminate in value statements about the evaluand (the program, policy, or intervention being evaluated), or they provide necessary supporting information that feeds directly into such value statements. The first type might look like “Did the program achieve adequate reach among the target population?” or “Was the intervention implemented with sufficient fidelity?” These questions demand judgments (inadequate, adequate, sufficient, optimal) about the program’s merit or performance. The second type might look like “What proportion of eligible families enrolled?” or “How closely did delivery match the program model?” These questions are descriptive but serve an evaluative judgment: the enrollment rate matters because it informs whether reach was adequate; fidelity data matters because it informs whether implementation was sufficient.

Research questions, by contrast, pursue understanding without necessarily rendering judgment. “How do families make decisions about program enrollment?” is a valuable research question, but it describes a phenomenon without evaluating the program. It might belong in an evaluation if it directly informs a value claim (perhaps about whether the enrollment process is appropriately designed), but its presence must be justified by that connection.

This distinction has practical consequences. When evaluation plans include questions that cannot be traced to value judgments about the evaluand, those questions consume resources without advancing the evaluation’s purpose. They may produce interesting findings, but interesting is not the same as evaluative.

One reason this matters so urgently is that evaluation questions tend to get locked in early. The typical workflow, from request for proposals, to submitted proposal, to contract, to evaluation plan, creates multiple moments where questions get formalized and ossified. A vague question in an RFP becomes a contractual obligation. A conflation in a proposal becomes the main organizing structure of a final report. By the time we have to really face up to the problem, the political and logistical costs of revision are insurmountable.

Although it helps, the optimal solution is not to build in more revision points. The optimal solution is to get the questions right earlier, which requires knowing what “right” looks like.

Two Types of Evaluation Questions

Within the domain of legitimate evaluation questions, a further distinction matters: the difference between discovery questions and measurement questions. This distinction is not unique to evaluation, in fact, I would argue that it structures inquiry across the sciences, but it has particular importance for evaluation design.

Consider an analogy from epidemiology. When a novel pathogen emerges, researchers must first characterize it: What are its symptoms? How does it spread? What populations does it affect? These are discovery questions. Only after this characterization can researchers meaningfully ask measurement questions: What is the infection fatality rate? What proportion of cases are asymptomatic? How effective is a given intervention at reducing transmission? There are several historical examples of diseases, like syphilis, that were characterized so poorly at first blush that for many years subsequent measurement of them was basically worthless. Attempting to measure before discovering what you’re measuring leads to confusion, like counting cases before you know what symptoms define a case, or estimating fatality rates before you understand the disease’s clinical course.

The same logic applies in evaluation. Discovery must precede measurement. I find that this leads to a couple of common types of evaluation questions.

1. Discovery Questions

“Discovery questions” ask what exists, what is happening, or what possibilities are present. They are open-ended by nature, and their answers are frequently inchoate in the minds of stakeholders or communities. Examples include:

What concerns do community members raise about the program?
What implementation challenges have emerged in the first year?
What recommendations do participants offer for improvement?
What unintended consequences has the program produced?

2. Measurement Questions

“Measurement questions” ask about magnitude, prevalence, or degree. They presuppose that we already know what we’re measuring and seek to quantify it. Examples include:

What proportion of participants report that the program met their needs?
How frequently do staff use the new assessment protocol?
How much did literacy scores on a particular test improve?
What is the cost per participant served?

Both types of question serve the evaluative purpose, but in different ways. Discovery questions identify the dimensions along which value judgments might be made: you cannot judge whether participant concerns are being adequately addressed until you know what those concerns are. Measurement questions provide the evidence for making those judgments: once you know the concerns, you can assess how prevalent each one is and whether the program’s response is sufficient.

Both types must ultimately connect to value claims about the evaluand. A discovery question that surfaces interesting information but doesn’t inform any judgment about the program may be worthwhile research, but it isn’t doing evaluation work.

A Typology of Poorly Formed Evaluation Questions

Granting this distinction between evaluation and research questions and the distinction between discovery and measurement, we can now catalog the ways that evaluation questions go wrong. I’ve identified and named some recurring failure modes that I encounter in my work. Each represents a different way that evaluation questions fail and each calls for a different fix.

1. Discovery-Measurement Conflation

The problem: The question combines discovery and measurement into a single query, making it impossible to answer coherently. The result is a question that actually needs to be broken into two easier questions in order to be properly answered.

Example: “Which program components deliver the best return on investment?”

This question asks us to identify what the program’s components are (discovery), determine the costs and effects of each (measurement), and rank them by efficiency… all at once. If we don’t yet know how participants and staff conceptualize the program’s components – and who knows if this is the right way to conceptualize them without additional investigation – we can’t assess their relative value. The question presumes an inventory that doesn’t exist, then asks us to perform calculations on it.

The fix: Decompose the question into sequential phases. First ask, “What the program’s distinct components or activities?” Then, “What are the costs and outcomes associated with each component?” Only then can you address, “Which components show the strongest relationship between costs and outcomes?”

2. Unspecified Evaluative Criteria

The problem: Key terms in the question lack the precision needed to guide inquiry, leaving evaluators uncertain about what would count as an answer.

Example: “Is the program cost-effective?”

Cost-effective compared to what alternative? Cost-effective for which outcomes? Over what time horizon? From whose perspective: the funder, the participant, society at large? This question provides no guidance on any of these points. One evaluator might compare program costs to a do-nothing alternative and focus on short-term employment gains from the funder’s perspective. Another might compare costs to an alternative intervention and examine long-term health outcomes from a societal perspective. Both could claim to have answered the question, yet their answers would be incommensurate.

The fix: Render the key terms precise. Specify the comparison (“cost-effective relative to standard case management”), the outcomes (“in terms of housing stability and employment retention”), the time horizon (“over a three-year follow-up period”), and the perspective (“from the funder’s perspective, counting only direct program expenditures”).

3. Evaluative Overreach

The problem: The question is too ambitious for available resources, setting the evaluation up for superficial treatment of important issues.

Example: “What is the impact of the program on participants, families, communities, and the broader field?”

No single evaluation can rigorously assess impact across all these levels. Attempting to do so typically means doing none of them well.

The fix: Narrow ruthlessly or prioritize explicitly. Decide which level of impact matters most for the decisions at hand. If funders need to know about participant outcomes, focus there. If the goal is to inform field-building, focus on contributions to the broader field. A focused evaluation that answers one question well is more useful than a sprawling evaluation that gestures at many.

4. Misdirected Evaluand

The problem: The question asks about something other than the evaluand (the program, policy, or intervention being evaluated), often the population, the context, or a comparison group, without recognizing this mismatch.

Example: “What are the educational needs of low-income families in our service area?”

This is a needs assessment question, not a program evaluation question. It asks about a population rather than a program. It may be important information, but it’s not evaluation of the program’s merit, worth, or significance. No answer to this question, however thorough, will constitute a value judgment about how well the program is performing.

The fix: Determine whether the question belongs in the evaluation at all. If it’s genuinely about understanding a comparison group or establishing a baseline, reformulate it as such and solve it through appropriate means (often secondary data or a separate study). If it’s not relevant to judging the program, delete it and reclaim the resources for questions that are.

5. Research Question

The problem: The question is well-formed as research but cannot be traced to any value judgment about the evaluand. It may produce interesting findings, but those findings don’t inform whether the program is good, effective, appropriate, or worth continuing.

Example: “What is the demographic composition of program participants?”

Demographics might matter but only if they connect to an evaluative claim. If the program aims to serve a particular population, then demographic data informs whether the program achieved equitable or targeted reach. But if the question is included simply because demographic information seems like something an evaluation should have, it consumes resources without advancing judgment.

The fix: Trace the question to its evaluative purpose. Ask: what value claim about the program does this question inform? If the answer is “the program achieved adequate reach among priority populations,” then reformulate the question to make that connection explicit: “To what extent did the program reach priority populations?” If no value claim can be identified, the question doesn’t belong in the evaluation.

6. Method-Bound Framing

The problem: The question embeds assumptions about data sources or methods, constraining the evaluation design unnecessarily.

Example: “What proportion of focus group participants feel that the program serves their needs?”

Focus groups are not designed to generate representative proportions. By building “focus group” into the question, we’ve created an evaluation question that cannot be answered with the method it specifies.

The fix: Strip the method language from the question. Ask instead, “What proportion of participants feel that the program serves their needs?” Now the question is method-neutral, and evaluators can select the appropriate data collection approach.

7. Unanchored Impact Claims

The problem: The question asks about program impact or effectiveness without specifying the standard against which impact will be judged.

Example: “Did the program improve outcomes for participants?”

Compared to what? Compared to no program? Compared to the program participants would have received otherwise? Compared to the participants’ situation before enrollment? Without a standard, we have no basis for judgment.

The fix: Make the evaluative standard explicit. This requires determining what kind of standard is appropriate for the evaluation’s purpose. Sometimes the standard is comparative: “Did participants in the program show greater improvement than similar individuals who did not participate?” This requires counterfactual reasoning and, ideally, a comparison group. Sometimes the standard is criterion-referenced: “Did the program achieve its stated goals?” This requires clear program objectives and agreed-upon thresholds for success. Sometimes the standard is normative: “Did the program meet professional or ethical standards for service delivery?” This requires articulating those standards explicitly. Sometimes the standard is economic: “Did the program generate benefits that exceed its costs?” This requires specifying whose costs and benefits count, what outcomes will be monetized, and over what time horizon. Each type of standard leads to a different design and the question should specify which standard is in play.

8. Double-barreled Evaluative Claims

The problem: The question asks multiple things disguised as a single query.

Example: “How well did the program engage participants and support their learning?”

Engagement and learning are distinct constructs. A program might excel at one and fail at the other. Combining them into a single question obscures this possibility and makes it difficult to know what evidence would answer the question.

The fix: Split compound questions into their component parts. Ask separately, “How well did the program engage participants?” and “How well did the program support participant learning?” Each question can then receive focused attention and appropriate methods.

A Worked Example: Fixing Evaluation Questions

To see how these failure modes interact in practice, consider a question that might appear in an actual RFP:

“Based on interviews with program participants, what are the most significant ways the program has impacted families and communities?”

This question has multiple problems. Let’s diagnose and fix them iteratively.

First pass: Method-bound framing. The phrase “based on interviews” embeds a method assumption. Interviews may or may not be the right approach, but that’s a design decision, not part of the question itself. Strip it out:

“What are the most significant ways the program has impacted families and communities?”

Second pass: Double-barreled evaluative claims. “Families and communities” are distinct units of analysis. Impact on families (household-level changes) is different from impact on communities (neighborhood-level changes). A program might affect one without affecting the other. Split the question:

“What are the most significant ways the program has impacted families?”

“What are the most significant ways the program has impacted communities?”

Third pass: Discovery-measurement conflation. Each question still asks us to identify impacts (discovery) and judge their significance (measurement) simultaneously. We can’t know which impacts are “most significant” until we know what impacts exist. Decompose into sequential questions:

“In what ways has the program impacted families?”

“Which of these impacts do stakeholders consider most significant?”

We do the same for communities.

Fourth pass: Unanchored impact claims. The word “impacted” implies causation, but the question doesn’t specify a comparison. Are we comparing to families who didn’t participate? To families’ situations before the program? To what the program intended to achieve? Make the comparison explicit:

“What changes have families experienced that they attribute to the program?” “Which of these changes do stakeholders consider most significant?”

This formulation is more modest since it asks about attributed changes rather than proven impacts. It’s also more honest about what the evaluation can deliver without a comparison group. Stronger designs could make explicit comparisons to families that did not participate or those on a waitlist.

Final result. We started with one poorly-formed question and ended with four well-formed questions (two about families, two about communities):

What changes have families experienced that they attribute to the program?
Which of these changes do stakeholders consider most significant?
What changes have communities experienced that members attribute to the program?
Which of these changes do stakeholders consider most significant?

This is more questions than we started with but each one is answerable. The first and third are discovery questions that can be addressed through interviews or open-ended survey items. The second and fourth are measurement questions that can be addressed through ranking exercises or scaled survey items.

We can also verify that these questions connect to value judgments about the evaluand. Questions 1 and 3 provide the supporting information needed for judgment: they identify what changes occurred. Questions 2 and 4 move toward evaluation by asking which changes matter most. The implicit value claim is something like: “The program is valuable because it produced meaningful benefits for families and communities.” The four questions, taken together, provide the evidence needed to test that claim.

Managing Expectations

None of this is easy in practice. Stakeholders arrive with questions they’ve inherited from previous evaluations, funders, or board members. Timelines are set before questions are refined and budgets assume a scope that may not match what the questions actually require.

The temptation is to try to pull a rabbit out of a hat, promising that clever methods or extra effort can overcome fundamental problems with the questions themselves. We need to resist this temptation because a poorly formed question cannot be rescued by good execution; it can only be masked temporarily until the final report reveals the incoherence.

The better path is honest conversation early. When you encounter a poorly formed evaluation question, name the problem and propose a fix. Explain what the question can and cannot support. If the timeline or budget doesn’t allow for good questions to be answered well, say so and help stakeholders choose which questions matter most.

Getting Evaluation Questions Right

Good evaluation questions share a few qualities: they’re clear enough that different evaluators would agree on what counts as an answer, bounded enough to be answerable with available resources, and they ask about the thing they claim to ask about. They distinguish between discovery and measurement and they make their evaluative standards explicit. Most fundamentally, they can be traced, directly or through supporting information, to value judgments about the evaluand.

Poorly formed questions, by contrast, fail in predictable ways. They conflate different types of inquiry, presuppose what should actually be discovered, embed method assumptions, or ask about the wrong targets entirely. Learning to recognize these failure modes is the first step toward fixing them. Finally, if you are currently in the middle of a project with poorly-formed evaluation questions, it is often better to change them than try to finish the project as-is.

Getting these questions right is not a preliminary matter to be rushed through on the way to “real” evaluation work. Rather, the questions we ask determine where resources will be expended and shape what we will learn.

Post by Anthony Clairmont, Ph.D., Wednesday, May 27, 2026

“If you can’t measure it, you can’t improve it.“

– Peter Drucker

Let’s Connect