- AI & ML
New 6 Part Blog Series - "All Judgements Are Comparisons”: The Human Foundation Beneath AI Assessment
 
        For humans, as Donald Laming observed, “There is no absolute judgement. All judgements are comparisons of one thing with another.” We judge quality by relating one piece of work to others we’ve seen, flexibly adapting to shifting expectations and context.
But AI bypasses this messy world of human comparison. It applies rubrics in isolation, scoring each script without regard for the changing standards, context, or nuance that shape professional judgement. As a result, AI’s assessments risk being out of sync with what educators and society actually value.
This gap is not just academic—if unaddressed, it threatens fairness, trust, and credibility across education. That’s why ongoing calibration to a human “gold standard” is not optional—it’s urgent and essential for trustworthy assessment in the AI era.
An urgent problem
At the heart of every debate around assessment, whether about student work, creative output, or performance reviews, lies a deceptively simple truth that all judgements are comparative. Laming’s oft-quoted observation reveals that for humans, meaning and value do not exist in a vacuum: every mark, grade, or ranking rests on comparison.
This insight isn’t just a philosophical curiosity, it is the key to understanding why trust and validity in marking are so challenging in an era of artificial intelligence. AI does not, and cannot, judge comparatively. AI assessment mimics the form of human decision-making, but lacks its truly comparative, context-grounded nature.
This is a hugely problematic and must be addressed.
As AI-powered tools increasingly participate in educational assessment, a critical challenge comes into sharp focus: AI assesses very differently from humans. This difference is more than technical; it threatens the fairness, trust, and validity of assessment systems if left unaddressed.
“There is no absolute judgement. All judgements are comparisons of one thing with another.” — Donald Laming, Human Judgment: The Eye of the Beholder (2004)
Why Comparative Judgement Is Inescapable for Humans
When experienced teachers or examiners assess work, they are always, implicitly or explicitly, drawing on a store of comparisons:
- To mark a student essay as “outstanding”, an examiner cannot avoid recalling what other outstanding work looks like.
- Even the most detailed rubrics can’t prevent markers from judging relatively - anchoring their decisions against the “best”, “worst”, and “average” they have previously seen.
- Psychologists and psychometricians alike (Laming, Thurstone, and successors) have long observed that supposed “absolute” judgements are mere illusions; in reality, context heavily sways how we categorise and value.
This comparative instinct is both a strength and a liability: it allows humans to adapt, recognise nuance, and make contextual allowances - yet it can also lead to drift, inconsistency, and bias.
Why AI does not—and cannot—Judge comparatively
Enter artificial intelligence and large language models (LLMs): when these systems “mark” work, they bypass the messy world of human context. Instead, they:
- Statistically map features of the work to an abstract, encoded understanding of the rubric.
- Score each script in isolation, never referencing what came before or after, nor the real-world distributions, shifts, or surprises a human marker might encounter.
- Are hyper-consistent - but also inflexible, sometimes missing how “quality” and “standard” evolves from year to year, or across different populations.
The result? AI assessment mimics the form of human decision-making, but lacks its truly comparative, context-grounded nature.
Read the full White Paper: Beyond Human Moderation: The Case for Automated AI Validation in Educational Assessment
The Calibration Solution: Gold Standard Benchmarks
So, how can we ensure that AI assessment remains trustworthy - reflecting not only historic data, but current professional thinking?
- Gold Standard calibration emerges as the answer. By synthesising expert human consensus via adaptive comparative judgement platforms such as RM Compare, we can capture a dynamic benchmark, rooted in expert comparison.
- Regularly calibrating AI to this human “Gold Standard” means automated marking can stay in sync with evolving expectations, values, and nuances that only comparative human judgement can reveal.
Measure what you treasure
The heart of the RM Compare philosophy is ensuring assessment systems can truly “measure what you treasure”—that is, capturing the qualities, standards, and values your community most cares about, as they evolve.
But if AI assesses in isolation, without continual calibration to expert human consensus, it may end up measuring only what was easy to codify or automate—not what you actually value in student work. Human-driven comparative judgement, anchored by professional judgement and adaptive consensus, lets you re-align what’s measured to what’s treasured—continually, transparently, and at scale.
This foundational divide between comparative human judgement and formulaic AI approaches frames every challenge and opportunity in contemporary assessment. Over the coming posts, we’ll explore:
- Why LLMs and humans often disagree about “quality”
- How trust can be rebuilt in automated judgement
- Concrete solutions for embedding fairness, auditability, and professional confidence at every stage
Ready to find out more? Read on.
The blog series in full
- Introduction: Blog Series Introduction: Can We Trust AI to Understand Value and Quality?
- Blog 1: Why the Discrepancy Between Human and AI Assessment Matters—and Must Be Addressed
- Blog 2: Variation in LLM perception on value and quality.
- Blog 3: Who is Assessing the AI that is Assessing Students?
- Blog 4: Building Trust: From “Ranks to Rulers” to On-Demand Marking
- Blog 5: Fairness in Focus: The AI Validation Layer Proof of Concept Powered by RM Compare
- Blog 6: RM Compare as the Gold Standard Validation Layer: The Research Behind Trust in AI Marking