Opinion

Back to the Future: Rethinking Assessment in a World of Uncertainty

By Mark House

19th mar 2026

In 1792, revolutionaries in Paris abolished a king, Americans calmly re‑elected a president, and Cambridge quietly invented something that still shapes millions of lives every year: exam marking. While politics and industry were being rebuilt in public, assessment was being rebuilt on paper. Two centuries later, we are still living inside that decision – and only now starting to see its limits.

As we have written about before the future of assessment will be non-deterministic. This post gives this some historical context.

1792: When marking was invented

Traditional exam script marking at Cambridge began in 1792. Until then, examiners would convene at the end of the examination period, review each student’s scripts, and argue their way towards an agreed rank order based on professional judgement. That year, William Farish, the new Proctor of Examinations and a chemistry lecturer, proposed something radical for its time: give every answer a number, add them up, and let arithmetic decide the rank order.

It was a clever solution to real pressures. Undergraduate numbers were rising, scripts were piling up, and there were legitimate worries about personal bias and inconsistency. Marking made the work divisible, distributable and auditable. It fitted the wider 18th‑century impulse that was reshaping factories, bureaucracies and states: the drive to make complex realities legible and controllable through numbers.

Over time, marking became the default technology of assessment. It shaped the design of exams, mark schemes and even curricula, and once you have marks you can have grades, thresholds, league tables and performance targets. The clockwork of the modern exam system started to tick.

The clockwork assumption

Marking rests, often implicitly, on a broader way of seeing the world that took hold in the Enlightenment and the Industrial Revolution. Isaac Newton’s mechanics had helped persuade Europe that the universe was, at heart, orderly and law‑governed – a kind of giant mechanism whose workings could, in principle, be written down.

A little later, Pierre‑Simon Laplace pushed the idea to its logical extreme, imagining a super‑intelligence – later dubbed “Laplace’s Demon” – that, knowing every particle’s position and momentum, could predict the entire future.

As technologies like precision clocks and marine chronometers improved, they did more than help navigation and industry; they reinforced the sense that reality itself was a machine we could fully understand and control if only we had the right instruments.

John Harrison’s chronometers, accurate enough to fix a ship’s longitude at sea, became icons of this new confidence in measurement and control. Factories introduced new forms of time discipline, and work – and eventually school – became organised around the clock rather than around the slower, more irregular rhythms of tasks and seasons

Marking fits this mindset perfectly. It treats student performances as systems that can be decomposed into components, assigned points, and reassembled into a single, definitive score. The total mark looks like a precise reading on a dial: a measure of “how good” the performance is, just as a thermometer shows the temperature.

What gets lost when we turn essays into dials

For some purposes, that is exactly what we want. Marking works well when tasks are tightly structured, the construct is clearly defined, and answers fall into relatively clear categories of right, wrong, or partially correct; multiple‑choice tests or short, constrained responses lend themselves to this treatment.

But rich human performances do not behave like clockwork mechanisms. Writing, speaking, design, practical work and creative problem‑solving are complex, context‑dependent and inherently qualitative. Two equally expert examiners may legitimately notice different strengths and weaknesses, prioritise different aspects, and arrive at different judgements – and even the same examiner may respond differently on a different day.

Yet our systems often behave as if there is a single, hidden “true mark” inside each performance, and our job is simply to reveal it accurately enough. We design detailed mark schemes, double‑marking procedures, moderation processes and statistical controls to reduce visible noise. The final grade is then presented as if it captured something solid and certain about the learner, rather than a best‑effort summary of a messy reality.

The tension shows up whenever we look closely at reliability. Analyses of high‑stakes exams, including GCSEs, suggest that only around three‑quarters of grades would be confirmed if every script were re‑marked by a senior examiner, with perhaps one in four likely to change, especially in subjects involving extended writing. In GCSE English, estimates of reliability often cluster closer to 60%, implying that a significant minority of students may be holding a grade that another equally qualified examiner would have judged differently. That leaves us with an uncomfortable gap between how confidently we talk about grades and how fragile some of them actually are.

Modern science quietly withdraws the promise of certainty

While assessment has been perfecting its clockwork, other parts of our understanding of the world have been moving in a different direction. Quantum mechanics, especially the uncertainty principle, shows that at a fundamental level there are limits to how precisely certain properties can be known at the same time. Complex systems science, from ecology to economics, highlights how non‑linear interactions and feedback make many outcomes inherently unpredictable. Social scientists have spent decades cataloguing the limitations of measurement when the thing being measured is human, contextual and self‑interpreting.

The upshot is not that measurement is useless, but that the old promise of total predictability and a single “view from nowhere” does not hold. We live in a world where uncertainty, ambiguity and observer‑effects are not temporary gaps in our knowledge but structural features of the systems we care about.

In that sense, our assessment culture is somewhat out of step. We continue to act as if ever finer marks and grades will eventually rid us of uncertainty, even as the rest of our intellectual life has learned to live with it.

1792 as a world of uncertainty

All of this makes 1792 an even more interesting backdrop. In Paris, Louis XVI was on his way to trial and execution, while Maximilien Robespierre and his allies were urging the new Convention to complete the revolution by founding a republic.

Across the Atlantic, George Washington was unanimously re‑elected, giving the young American republic a second term of steady leadership while Europe burned. Ordinary lives were being reshaped by forces – war, harvests, disease, political upheaval – that no one could fully predict or control.

Before 1792, most people’s experience of the world was saturated with this kind of uncertainty. Religion, myth and local custom offered ways of living with mystery rather than eliminating it. Communities made sense of events together, through stories, argument and shared judgement.

Academic assessment reflected that culture. Examiners discussed students’ work, compared performances, and argued their way towards a shared ordering of quality. The process was imperfect and influenced by bias and hierarchy, but it was fundamentally conversational and judgement‑based. It treated complex human work as something to be interpreted together, not simply as raw material to be processed into numbers.

The Industrial Revolution and the rise of marking did not create judgement, but they did marginalise it. Over time, examiners became technicians applying point schemes, rather than professionals engaged in rich dialogue about what quality looks like in their subject. Assessment drifted away from teaching and learning, into a specialist, somewhat opaque activity that happened “to” students rather than “with” them.

Back to the future with Adaptive Comparative Judgement

Adaptive Comparative Judgement (ACJ) offers a way to recover the best of that older, judgement‑centred tradition while also addressing its weaknesses. Instead of asking examiners to assign marks to individual pieces of work, ACJ asks them to do something humans are naturally good at: given two pieces of work, which one is better in relation to the construct we care about?

By presenting many such pairs to many judges, and using modern psychometric models to analyse the pattern of decisions, ACJ builds a stable, highly reliable rank order of work. Each individual judgement is qualitative and subjective; the aggregate behaves in a robust, quantifiable way. The statistics do not replace professional judgement; they depend on it and make its structure visible.

The mathematics behind this is not new. In the 1920s, psychologist Louis Thurstone showed how simple pairwise comparisons could be turned into a rigorous scale, in what he called the Law of Comparative Judgment. Modern ACJ systems build on that insight, using contemporary psychometric models to turn many human judgements into a stable rank order of work.

In other words, ACJ accepts that there is no hidden, perfectly knowable “true mark” inside each essay. Instead, it models what a community of informed practitioners, looking at the same body of work, collectively judge to be better or worse. It is honest about uncertainty at the level of any single decision, and rigorous in the way it uses many decisions to build a trustworthy overall picture.

Digital platforms make this practical at a scale that would have been unthinkable in Farish’s Cambridge. Examiners can judge from anywhere; work can be sampled, re‑ordered and re‑presented adaptively to focus attention where information is most needed; and the process generates rich data about the construct, the work and the judges themselves.

Drivers for change: cracks, resistance and AI

These ideas are not new. Two centuries after Farish, Alastair Pollitt and Declan Lynch asked a disarming question: for essays and rich performances, should we stop marking altogether and return to comparative judgement? Since then, digital platforms, the shock of COVID and now generative AI have all exposed the limits of traditional marking and increased the pressure for more authentic, performance‑based assessment.

Yet large systems still cling to marks and grades. Partly this is understandable caution: accountability frameworks, progression routes and public expectations have been built around the apparent solidity of grades, and a single number feels safe, auditable and comparable over time. Partly it reflects our attachment to the clockwork story itself – the comforting sense that, if we just work hard enough on our mark schemes and statistics, we can banish uncertainty.

Recent international work, including digital education outlooks from bodies like the OECD, suggests that this stance will become harder to sustain. Faced with AI systems that can fabricate polished essays at the touch of a button, these reports argue that the future of assessment lies in human‑in‑the‑loop models that prioritise process over product, protect the social credibility of professional judgement, and use AI as a “whisperer” rather than the final arbiter. That is, in many ways, the same move “back to the future” that Adaptive Comparative Judgement represents: using powerful digital tools to make rich human comparison scalable again, instead of clinging to the illusion that a single automated score can tell us everything that matters.

A multi‑modal way forward

If we let go of the fantasy that one number can tell us everything, the question becomes: what should replace it? One answer is not to pick a single new silver bullet, but to combine different ways of seeing – to use multiple mirrors rather than a single dial.

The left mirror is holistic assessment through Adaptive Comparative Judgement in RM Compare, developed by Alastair Pollitt and Declan Lynch, giving a rich, professional view of whole performances where quality emerges from many expert comparisons rather than from a checklist of points. The right mirror is absolute, rubric‑based assessment, ensuring learners meet defined standards and competencies where clear criteria and thresholds really matter. The centre mirror is authenticity checking, using AI thoughtfully to help verify that the work genuinely reflects the learner in an age of GenAI assistance.

Used together, these modes give a much more honest, multi‑perspective picture of learning than any single mark or grade can. They embody the same shift this article has traced: away from a clockwork view of assessment as a perfect measurement machine, and towards a future where we use powerful digital tools to support expert judgement, embrace uncertainty and understand learner performance in the round.

An invitation to let go – a little – of certainty

Seen in this light, ACJ is not a quirky alternative to “proper” marking. It is a natural next step in the long story that began with Newton’s mechanics, ran through Farish’s neat arithmetic at Cambridge, and was questioned by Pollitt and Lynch’s call to “stop marking exams”. Marking was a brilliant invention for a clockwork age, and it remains a powerful tool for certain kinds of assessment, but as our understanding of the world has shifted, and as we have asked exams to capture ever richer kinds of learning, its limitations have become harder to ignore.

We are unlikely to abandon marks and grades any time soon, nor should we. They play important roles in selection, accountability and communication. But we can be more modest about what they mean, more open about the uncertainty they contain, and more imaginative about the methods we use when we care about complex performances.

Adaptive Comparative Judgement, embedded in a multi‑modal ecosystem alongside rubric‑based assessment and authenticity checking, points towards a future in which assessment is once again a deeply professional, discursive activity, supported by technology rather than constrained by it. It asks us to step away from the comforting illusion that a single number can tell us everything, and to trust instead in the collective, calibrated judgement of experts – not as a retreat from rigour, but as a more honest and humane way of doing it.

Group	Name	Domain	Expiration	Security	Purpose
necessary	csrftoken	compare.rm.com	365 days, 0:00:00	HTTP	Helps prevent CSRF attacks
necessary	_cf_bm	vimeo.com	1 day, 0:00:00	HTTP	Used to distinguish between humans and bots
preferences	wtm	compare.rm.com	365 days, 0:00:00	HTTP	Used to store users cookie preference choices
statistics	_ga	rm.com	365 days, 0:00:00	HTTP	Registers a unique ID used to generate statistical data on how visitor used the website
statistics	_ga_#	rm.com	365 days, 0:00:00	HTTP	Used by Google Analytics to collect data on user visits to the website
statistics	_hp2_#	rm.com	1 day, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	_hp2_id.#	rm.com	365 days, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	_hp2_ses_props.#	rm.com	1 day, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	vuid	vimeo.com	365 days, 0:00:00	HTTP	Collects data on the user's visits to the website
marketing	td	googletagmanager.com	0:00:00	HTTP	Used by Google Tag Manager to collect data on the user behaviour and interaction with the website
marketing	h	heapanalytics.com	0:00:00	HTTP	Collects data on the user behaviour and interaction with the website

Name	Domain	Purpose	Expiration	Security
csrftoken	compare.rm.com	Helps prevent CSRF attacks	365 days, 0:00:00	HTTP
_cf_bm	vimeo.com	Used to distinguish between humans and bots	1 day, 0:00:00	HTTP

Name	Domain	Purpose	Expiration	Security
_ga	rm.com	Registers a unique ID used to generate statistical data on how visitor used the website	365 days, 0:00:00	HTTP
_ga_#	rm.com	Used by Google Analytics to collect data on user visits to the website	365 days, 0:00:00	HTTP
_hp2_#	rm.com	Collects data on the user's navigation and behaviour on the website	1 day, 0:00:00	HTTP
_hp2_id.#	rm.com	Collects data on the user's navigation and behaviour on the website	365 days, 0:00:00	HTTP
_hp2_ses_props.#	rm.com	Collects data on the user's navigation and behaviour on the website	1 day, 0:00:00	HTTP
vuid	vimeo.com	Collects data on the user's visits to the website	365 days, 0:00:00	HTTP

Name	Domain	Purpose	Expiration	Security
td	googletagmanager.com	Used by Google Tag Manager to collect data on the user behaviour and interaction with the website	0:00:00	HTTP
h	heapanalytics.com	Collects data on the user behaviour and interaction with the website	0:00:00	HTTP