Opinion

A Famous Exam Story With a Hidden Assessment Problem

By Mark House

13th apr 2026

The story of the Barometer Question is usually told as a joke at the expense of an inflexible examiner. But if you look at it through an assessment lens, it is really a story about design failure, construct clarity, and the importance of a strong holistic statement of quality.

In Alexander Calandra’s version, the student is asked to “show how it is possible to determine the height of a tall building with the aid of a barometer.” The examiner expects one answer: measure the pressure at the top and bottom, then use the pressure difference to calculate the height. Instead, the student offers a string of alternative solutions.

Tie the barometer to a rope, lower it to the street, and measure the rope.
Stand it in the sun and use similar triangles from the shadows.
Walk up the stairs, marking off barometer‑lengths along the wall.
Suspend it as a pendulum and infer the height from the change in gravity.
Knock on the superintendent’s door and offer to trade the barometer for the building’s height.

Every one of these answers is, in its own way, correct. Each one gets you from barometer to building height. Yet none of them, at least initially, demonstrates what the examiner thought they were assessing: understanding of pressure and its relationship with height in a gravitational field. The question’s wording suggests that the construct is “ability to determine the height of a building using a barometer.” The examiner’s marking expectations reveal that, in fact, the construct is “ability to apply the concept of pressure difference with altitude.” That gap is where the trouble starts.

Construct Confusion and the “Moral Dilemma”

Calandra describes this as a moral dilemma for the examiner. By the rules of the exam, a correct answer should receive full credit. At the same time, to reward the student would be to label as “competent in physics” someone who has not yet displayed the specific knowledge the test is intended to measure. Pass or fail both look wrong.

The way out, in the story, is to give the student another attempt and explicitly insist that the answer this time must demonstrate some physics. The student then proposes timing the fall of the barometer, using the familiar kinematics equation d=12at2d=21at2 to derive the height, and finally admits that he knew the expected barometric method all along but was tired of being “taught how to think” instead of being taught the actual structure of the subject.

What is this, if not an assessment design problem? The task is open enough to admit many legitimate solution paths, but the mark scheme implicitly assumes only one. The construct that matters to the examiner is never clearly surfaced in the question. When an able, creative student engages with the task on its own terms, the examiner has to retro‑fit judgement to rescue validity. At that point, the argument is no longer about whether the response is correct; it’s about whether it is evidence of the right thing.

What the Barometer Question Tells Assessors

This is not an abstract fable. It represents a common pattern. We pose apparently simple problems that are, in reality, doing double or triple duty. They are exercises in problem‑solving, in applying specific disciplinary concepts, in communicating reasoning. We then attach a mark scheme which silently privileges one intended pathway. As soon as a student takes a different but defensible route, we face the same dilemma: do we reward correctness and creativity, or do we protect the construct we had in mind?

One response is to refine the question until it tightly constrains the method. If you really wish to test knowledge of pressure, you can steer students much more precisely, either through the wording of the prompt or by stating the expected approach in advance. The trouble is that this can come at the cost of authenticity and richness. Over‑engineered questions reduce the space for genuine thinking to appear, and encourage students to reverse‑engineer what they think the examiner wants.

The deeper issue exposed by the barometer story is that our questions often carry hidden constructs. We talk about “problem‑solving” or “understanding,” but the only thing rewarded in practice is reproducing a specific method. That tension only becomes visible when a student produces something that is both clearly good and clearly off‑script.

Adaptive Comparative Judgement: A Different Response

Adaptive Comparative Judgement offers a different way to approach the problem. Instead of trying to legislate in advance what a “good” solution will look like, it starts by accepting the reality that responses will be diverse and sometimes surprising. In an ACJ task inspired by the barometer story, students might be asked to “explain how, using only a barometer, you could determine the height of a tall building, and justify your method scientifically.” Teachers using RM Compare would see pairs of student responses and make holistic judgements: which of these two better demonstrates the understanding we care about here?

This shift from “does the answer match the model solution?” to “which of these is the better piece of work?” aligns much more naturally with the complexity of tasks like the barometer question. It makes room for legitimate alternative methods, provided they genuinely evidence the construct. It also gives teachers a principled way to distinguish between answers that are merely clever and answers that show deep understanding.

The Central Role of the Holistic Statement

That phrase - “better demonstrates the understanding we care about” - is where the holistic statement becomes critical. In ACJ, the holistic statement articulates, as clearly as possible, what counts as quality in the context of this particular task. It is not a checklist of steps. It is not a mechanical mark scheme. It is a concise, shared description of the construct: for example, “sound application of relevant physical principles (such as pressure, gravitational acceleration, or kinematics) to a plausible method, explained clearly enough that another person could follow the reasoning.”

When judges compare work, they are not hunting for a single canonical technique. They are asking which response, taken as a whole, more closely embodies that statement. That is why the holistic statement is so important. It forces assessment designers to confront the question that the barometer story exposes: what exactly are we trying to value here?

If the answer is “any physically coherent method of measuring the building’s height,” then the superintendent trade answer may fall away, not because it is wrong, but because it shows no engagement with physics at all. If the answer is “creativity in scientific problem‑solving grounded in real principles,” then a well‑argued alternative method can sit comfortably near the top of the emerging rank, even if it was never foreseen when the task was set. And if the answer is, more narrowly, “ability to use pressure differences,” then the holistic statement will say so explicitly, and judges will align their comparisons accordingly.

In RM Compare, the holistic statement also functions as a kind of contract among judges. It anchors their comparative decisions, guides professional discussion, and surfaces disagreement. Because it is written as a whole description of the desired response rather than as atomised criteria, it encourages judges to consider the integrity of a student’s solution. Is the method scientifically plausible? Is the explanation clear? Does it show more than a memorised formula? Those are necessarily holistic questions. They resist being fragmented into points for “mentions barometer,” “includes equation,” “gives final height.” When many judges apply that shared holistic lens across a large set of scripts, the resulting rank order and scaled scores reflect a consensus view of quality that is both principled and adaptable.

From Discomfort to Design

Seen this way, ACJ does not bypass the dilemma that Calandra identifies; it makes it explicit, earlier, and in a more constructive form. Rather than discovering at marking time that your question contains a hidden conflict between creativity and construct, you negotiate that balance into the holistic statement up front. You decide, as a department or a faculty, how much you wish to reward unconventional reasoning, and under what conditions. You acknowledge that human judgement is central to all of this, and you give that judgement a structured, statistical framework in which to operate.

The barometer question has travelled a long way from its origins, retold and reshaped in many contexts. What has kept it alive is not the physics; it is the discomfort teachers feel when a student’s answer exposes the gap between what our questions say and what we truly want to assess. For those working with RM Compare, that discomfort is not a bug to be smoothed over by ever tighter mark schemes. It is a signal, telling us to invest in better prompts, clearer constructs, and stronger holistic statements, so that when the next “barometer question” appears in your classroom, your assessment system is ready to do it justice.

Group	Name	Domain	Expiration	Security	Purpose
necessary	csrftoken	compare.rm.com	365 days, 0:00:00	HTTP	Helps prevent CSRF attacks
necessary	_cf_bm	vimeo.com	1 day, 0:00:00	HTTP	Used to distinguish between humans and bots
preferences	wtm	compare.rm.com	365 days, 0:00:00	HTTP	Used to store users cookie preference choices
statistics	_ga	rm.com	365 days, 0:00:00	HTTP	Registers a unique ID used to generate statistical data on how visitor used the website
statistics	_ga_#	rm.com	365 days, 0:00:00	HTTP	Used by Google Analytics to collect data on user visits to the website
statistics	_hp2_#	rm.com	1 day, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	_hp2_id.#	rm.com	365 days, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	_hp2_ses_props.#	rm.com	1 day, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	vuid	vimeo.com	365 days, 0:00:00	HTTP	Collects data on the user's visits to the website
marketing	td	googletagmanager.com	0:00:00	HTTP	Used by Google Tag Manager to collect data on the user behaviour and interaction with the website
marketing	h	heapanalytics.com	0:00:00	HTTP	Collects data on the user behaviour and interaction with the website

Name	Domain	Purpose	Expiration	Security
csrftoken	compare.rm.com	Helps prevent CSRF attacks	365 days, 0:00:00	HTTP
_cf_bm	vimeo.com	Used to distinguish between humans and bots	1 day, 0:00:00	HTTP

Name	Domain	Purpose	Expiration	Security
_ga	rm.com	Registers a unique ID used to generate statistical data on how visitor used the website	365 days, 0:00:00	HTTP
_ga_#	rm.com	Used by Google Analytics to collect data on user visits to the website	365 days, 0:00:00	HTTP
_hp2_#	rm.com	Collects data on the user's navigation and behaviour on the website	1 day, 0:00:00	HTTP
_hp2_id.#	rm.com	Collects data on the user's navigation and behaviour on the website	365 days, 0:00:00	HTTP
_hp2_ses_props.#	rm.com	Collects data on the user's navigation and behaviour on the website	1 day, 0:00:00	HTTP
vuid	vimeo.com	Collects data on the user's visits to the website	365 days, 0:00:00	HTTP

Name	Domain	Purpose	Expiration	Security
td	googletagmanager.com	Used by Google Tag Manager to collect data on the user behaviour and interaction with the website	0:00:00	HTTP
h	heapanalytics.com	Collects data on the user behaviour and interaction with the website	0:00:00	HTTP