AI & ML

Why the Discrepancy Between Human and AI Assessment Matters—and Must Be Addressed

By Mark House

30th oct 2025

As AI-powered tools increasingly participate in educational assessment, a critical challenge comes into sharp focus: AI judges very differently from humans. This difference is more than technical; it threatens the fairness, trust, and validity of assessment systems if left unaddressed.

Humans Judge by Comparison—AI Does Not

Experienced human markers, guided by rubrics but relying heavily on memory, context, and relative judgement, compare each student’s work to others and to shifting standards. This relational approach allows humans to:

Adapt to changing cohort quality and curriculum evolution
Recognize creative, context-dependent excellence outside strict rubric limits
Adjust judgement day-to-day based on what is seen, providing nuanced decision-making

AI and large language models, by contrast, score each work independently. They predict marks from statistical associations with rubric criteria learned from training data, applying these “absolute” scores without reference to other items or evolving standards.

“There is no absolute judgement. All judgements are comparisons of one thing with another.” — Donald Laming, Human Judgment: The Eye of the Beholder (2004)

The Consequences of Discrepancy

The result of this fundamental difference is potentially serious:

Drift in Standards: AI marks may not keep pace with evolving human judgements about quality, creating a growing gap between what professionals and communities value and what AI scores reflect.
Loss of Context Sensitivity: AI can undervalue innovative or unconventional responses important in real-world learning but not captured by fixed rubric features.
Reduced Stakeholder Trust: Without alignment to human comparative judgement, AI marks risk being seen as opaque, inflexible, and unfair, damaging confidence in assessment results.
Potential for Systemic Bias: Inconsistent calibration may embed subtle errors at scale, disproportionately affecting particular learner groups.

Why Addressing This Discrepancy Is Urgent

To safeguard fairness and validity:

Calibration to Human Gold Standards is Non-Negotiable: AI must be regularly benchmarked and calibrated to expert human consensus built through adaptive comparative judgement. This ensures AI scores reflect current professional standards and contextual nuances.
Preserving Human Expertise: Rather than replacing human judgement, AI should amplify it through continuous validation against evolving human consensus.
Ensuring Transparency and Trust: Calibration processes create auditable, transparent validation layers—vital for stakeholder confidence in an increasingly automated assessment landscape.

Conclusion

In sum, the human-AI assessment discrepancy demands deliberate, ongoing intervention. Gold standard calibration is fundamental to ensuring AI marks are valid, fair, and trusted, making this alignment perhaps the single most important challenge for modern educational assessment systems.

Group	Name	Domain	Expiration	Security	Purpose
necessary	csrftoken	compare.rm.com	365 days, 0:00:00	HTTP	Helps prevent CSRF attacks
necessary	_cf_bm	vimeo.com	1 day, 0:00:00	HTTP	Used to distinguish between humans and bots
preferences	wtm	compare.rm.com	365 days, 0:00:00	HTTP	Used to store users cookie preference choices
statistics	_ga	rm.com	365 days, 0:00:00	HTTP	Registers a unique ID used to generate statistical data on how visitor used the website
statistics	_ga_#	rm.com	365 days, 0:00:00	HTTP	Used by Google Analytics to collect data on user visits to the website
statistics	_hp2_#	rm.com	1 day, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	_hp2_id.#	rm.com	365 days, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	_hp2_ses_props.#	rm.com	1 day, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	vuid	vimeo.com	365 days, 0:00:00	HTTP	Collects data on the user's visits to the website
marketing	td	googletagmanager.com	0:00:00	HTTP	Used by Google Tag Manager to collect data on the user behaviour and interaction with the website
marketing	h	heapanalytics.com	0:00:00	HTTP	Collects data on the user behaviour and interaction with the website

Name	Domain	Purpose	Expiration	Security
csrftoken	compare.rm.com	Helps prevent CSRF attacks	365 days, 0:00:00	HTTP
_cf_bm	vimeo.com	Used to distinguish between humans and bots	1 day, 0:00:00	HTTP

Name	Domain	Purpose	Expiration	Security
_ga	rm.com	Registers a unique ID used to generate statistical data on how visitor used the website	365 days, 0:00:00	HTTP
_ga_#	rm.com	Used by Google Analytics to collect data on user visits to the website	365 days, 0:00:00	HTTP
_hp2_#	rm.com	Collects data on the user's navigation and behaviour on the website	1 day, 0:00:00	HTTP
_hp2_id.#	rm.com	Collects data on the user's navigation and behaviour on the website	365 days, 0:00:00	HTTP
_hp2_ses_props.#	rm.com	Collects data on the user's navigation and behaviour on the website	1 day, 0:00:00	HTTP
vuid	vimeo.com	Collects data on the user's visits to the website	365 days, 0:00:00	HTTP

Name	Domain	Purpose	Expiration	Security
td	googletagmanager.com	Used by Google Tag Manager to collect data on the user behaviour and interaction with the website	0:00:00	HTTP
h	heapanalytics.com	Collects data on the user behaviour and interaction with the website	0:00:00	HTTP