AI & ML

New 6 Part Blog Series - "All Judgements Are Comparisons”: The Human Foundation Beneath AI Assessment

By Mark House

30th oct 2025

For humans, as Donald Laming observed, “There is no absolute judgement. All judgements are comparisons of one thing with another.” We judge quality by relating one piece of work to others we’ve seen, flexibly adapting to shifting expectations and context.

But AI bypasses this messy world of human comparison. It applies rubrics in isolation, scoring each script without regard for the changing standards, context, or nuance that shape professional judgement. As a result, AI’s assessments risk being out of sync with what educators and society actually value.

This gap is not just academic—if unaddressed, it threatens fairness, trust, and credibility across education. That’s why ongoing calibration to a human “gold standard” is not optional—it’s urgent and essential for trustworthy assessment in the AI era.

An urgent problem

At the heart of every debate around assessment, whether about student work, creative output, or performance reviews, lies a deceptively simple truth that all judgements are comparative. Laming’s oft-quoted observation reveals that for humans, meaning and value do not exist in a vacuum: every mark, grade, or ranking rests on comparison.

This insight isn’t just a philosophical curiosity, it is the key to understanding why trust and validity in marking are so challenging in an era of artificial intelligence. AI does not, and cannot, judge comparatively. AI assessment mimics the form of human decision-making, but lacks its truly comparative, context-grounded nature.

This is a hugely problematic and must be addressed.

As AI-powered tools increasingly participate in educational assessment, a critical challenge comes into sharp focus: AI assesses very differently from humans. This difference is more than technical; it threatens the fairness, trust, and validity of assessment systems if left unaddressed.

“There is no absolute judgement. All judgements are comparisons of one thing with another.” — Donald Laming, Human Judgment: The Eye of the Beholder (2004)

Why Comparative Judgement Is Inescapable for Humans

When experienced teachers or examiners assess work, they are always, implicitly or explicitly, drawing on a store of comparisons:

To mark a student essay as “outstanding”, an examiner cannot avoid recalling what other outstanding work looks like.
Even the most detailed rubrics can’t prevent markers from judging relatively - anchoring their decisions against the “best”, “worst”, and “average” they have previously seen.
Psychologists and psychometricians alike (Laming, Thurstone, and successors) have long observed that supposed “absolute” judgements are mere illusions; in reality, context heavily sways how we categorise and value.

This comparative instinct is both a strength and a liability: it allows humans to adapt, recognise nuance, and make contextual allowances - yet it can also lead to drift, inconsistency, and bias.

Why AI does not—and cannot—Judge comparatively

Enter artificial intelligence and large language models (LLMs): when these systems “mark” work, they bypass the messy world of human context. Instead, they:

Statistically map features of the work to an abstract, encoded understanding of the rubric.
Score each script in isolation, never referencing what came before or after, nor the real-world distributions, shifts, or surprises a human marker might encounter.
Are hyper-consistent - but also inflexible, sometimes missing how “quality” and “standard” evolves from year to year, or across different populations.

The result? AI assessment mimics the form of human decision-making, but lacks its truly comparative, context-grounded nature.

Read the full White Paper: Beyond Human Moderation: The Case for Automated AI Validation in Educational Assessment

View PDF

The Calibration Solution: Gold Standard Benchmarks

So, how can we ensure that AI assessment remains trustworthy - reflecting not only historic data, but current professional thinking?

Gold Standard calibration emerges as the answer. By synthesising expert human consensus via adaptive comparative judgement platforms such as RM Compare, we can capture a dynamic benchmark, rooted in expert comparison.
Regularly calibrating AI to this human “Gold Standard” means automated marking can stay in sync with evolving expectations, values, and nuances that only comparative human judgement can reveal.

Measure what you treasure

The heart of the RM Compare philosophy is ensuring assessment systems can truly “measure what you treasure”—that is, capturing the qualities, standards, and values your community most cares about, as they evolve.

But if AI assesses in isolation, without continual calibration to expert human consensus, it may end up measuring only what was easy to codify or automate—not what you actually value in student work. Human-driven comparative judgement, anchored by professional judgement and adaptive consensus, lets you re-align what’s measured to what’s treasured—continually, transparently, and at scale.

This foundational divide between comparative human judgement and formulaic AI approaches frames every challenge and opportunity in contemporary assessment. Over the coming posts, we’ll explore:

Why LLMs and humans often disagree about “quality”
How trust can be rebuilt in automated judgement
Concrete solutions for embedding fairness, auditability, and professional confidence at every stage

Ready to find out more? Read on.

Group	Name	Domain	Expiration	Security	Purpose
necessary	csrftoken	compare.rm.com	365 days, 0:00:00	HTTP	Helps prevent CSRF attacks
necessary	_cf_bm	vimeo.com	1 day, 0:00:00	HTTP	Used to distinguish between humans and bots
preferences	wtm	compare.rm.com	365 days, 0:00:00	HTTP	Used to store users cookie preference choices
statistics	_ga	rm.com	365 days, 0:00:00	HTTP	Registers a unique ID used to generate statistical data on how visitor used the website
statistics	_ga_#	rm.com	365 days, 0:00:00	HTTP	Used by Google Analytics to collect data on user visits to the website
statistics	_hp2_#	rm.com	1 day, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	_hp2_id.#	rm.com	365 days, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	_hp2_ses_props.#	rm.com	1 day, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	vuid	vimeo.com	365 days, 0:00:00	HTTP	Collects data on the user's visits to the website
marketing	td	googletagmanager.com	0:00:00	HTTP	Used by Google Tag Manager to collect data on the user behaviour and interaction with the website
marketing	h	heapanalytics.com	0:00:00	HTTP	Collects data on the user behaviour and interaction with the website

Name	Domain	Purpose	Expiration	Security
csrftoken	compare.rm.com	Helps prevent CSRF attacks	365 days, 0:00:00	HTTP
_cf_bm	vimeo.com	Used to distinguish between humans and bots	1 day, 0:00:00	HTTP

Name	Domain	Purpose	Expiration	Security
_ga	rm.com	Registers a unique ID used to generate statistical data on how visitor used the website	365 days, 0:00:00	HTTP
_ga_#	rm.com	Used by Google Analytics to collect data on user visits to the website	365 days, 0:00:00	HTTP
_hp2_#	rm.com	Collects data on the user's navigation and behaviour on the website	1 day, 0:00:00	HTTP
_hp2_id.#	rm.com	Collects data on the user's navigation and behaviour on the website	365 days, 0:00:00	HTTP
_hp2_ses_props.#	rm.com	Collects data on the user's navigation and behaviour on the website	1 day, 0:00:00	HTTP
vuid	vimeo.com	Collects data on the user's visits to the website	365 days, 0:00:00	HTTP

Name	Domain	Purpose	Expiration	Security
td	googletagmanager.com	Used by Google Tag Manager to collect data on the user behaviour and interaction with the website	0:00:00	HTTP
h	heapanalytics.com	Collects data on the user behaviour and interaction with the website	0:00:00	HTTP