AI & ML

Fairness in Focus: The AI Validation Layer Proof of Concept Powered by RM Compare

By Mark House

29th oct 2025

In today’s rapidly changing educational landscape, the key challenge isn’t just whether AI can mark student work, but how to ensure every mark is reliably fair. With that mission in mind, our latest proof of concept was designed to demonstrate why RM Compare is uniquely positioned as the foundation for trustworthy, scalable automated assessment.

This is the Fourth in a short series of blogs exploring the discrepancy between how AI and Humans approach assessment. There is an accompanying White Paper (Beyond Human Moderation: The Case for Automated AI Validation in Educational Assessment) that goes into even more detail.

Why RM Compare Matters

Automated validation relies entirely on the idea of a trusted “gold standard”—a set of marks, established by expert consensus, against which AI decisions can be compared and continuously calibrated. RM Compare is the ideal platform for this purpose. Its adaptive comparative judgement (ACJ) system gathers input from multiple expert markers, using sophisticated pairing and ranking to reach a defensible, consensus-based standard for every item—whether short answer, essay, artwork, or creative task.

By drawing on this technology, our POC ensures that the reference set for AI validation is not a single opinion or static rubric, but a rigorously established expert consensus. This professional consensus ranking is the “keystone”—it anchors fairness, accuracy, and accountability for every subsequent automated decision, providing confidence for all users.

Read the full White Paper: Beyond Human Moderation: The Case for Automated AI Validation in Educational Assessment

View PDF

Our Validation Process—Powered by RM Compare

We wanted to make the process transparent and accessible, so here’s how it works at a high level:

Calibration

First, AI marks a set of training items. These scores are compared to the RM Compare gold standard, which was generated via comparative judgement across a team of expert educators. Any discrepancies are flagged and analyzed, then AI parameters are adjusted. The cycle repeats until machine scores closely match the gold standard—a process you can see in the calibration diagram below:

Validation in Action

With calibration achieved, each new batch of assessments follows the validation journey. If the AI’s mark aligns with the gold standard, the score is awarded automatically (successful validation). When a discrepancy surfaces, that script is routed for human review and the insights are also used to further refine the model. This ensures that no “difficult to score” item is lost or ignored—the system learns and improves each time.

Demonstrating Success—Visual Proof

The test of a robust calibration is in the results. The chart below shows what a successful output looks like: blue dots are individual scripts, the red line is the AI’s calibrated scoring curve, and the proximity to the grey 1:1 line is direct evidence of fairness and fidelity to expert standards.

What Makes RM Compare the Keystone?

Defensible, Representative Benchmarking: RM Compare delivers a gold standard based on collective expert judgement, not just static rules or single opinions.
Scalability by Design: Its ACJ process is built for consistency—whether you need a reference set for 10, 100, or 10,000+ students.
True Continuous Improvement: The system isn’t frozen. Any new, challenging script helps refine the gold standard, improving fairness over time.

Looking Ahead

This POC is early proof that integrating RM Compare into automated assessment provides the confidence educational AI marking needs—anchoring every automated result with a transparent, consensus-driven benchmark.

Want to learn more, contribute, or see a demonstration? We welcome educators, examiners, and policymakers to engage with us as we help shape the next era of fair and scalable assessment.

Group	Name	Domain	Expiration	Security	Purpose
necessary	csrftoken	compare.rm.com	365 days, 0:00:00	HTTP	Helps prevent CSRF attacks
necessary	_cf_bm	vimeo.com	1 day, 0:00:00	HTTP	Used to distinguish between humans and bots
preferences	wtm	compare.rm.com	365 days, 0:00:00	HTTP	Used to store users cookie preference choices
statistics	_ga	rm.com	365 days, 0:00:00	HTTP	Registers a unique ID used to generate statistical data on how visitor used the website
statistics	_ga_#	rm.com	365 days, 0:00:00	HTTP	Used by Google Analytics to collect data on user visits to the website
statistics	_hp2_#	rm.com	1 day, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	_hp2_id.#	rm.com	365 days, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	_hp2_ses_props.#	rm.com	1 day, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	vuid	vimeo.com	365 days, 0:00:00	HTTP	Collects data on the user's visits to the website
marketing	td	googletagmanager.com	0:00:00	HTTP	Used by Google Tag Manager to collect data on the user behaviour and interaction with the website
marketing	h	heapanalytics.com	0:00:00	HTTP	Collects data on the user behaviour and interaction with the website

Name	Domain	Purpose	Expiration	Security
csrftoken	compare.rm.com	Helps prevent CSRF attacks	365 days, 0:00:00	HTTP
_cf_bm	vimeo.com	Used to distinguish between humans and bots	1 day, 0:00:00	HTTP

Name	Domain	Purpose	Expiration	Security
_ga	rm.com	Registers a unique ID used to generate statistical data on how visitor used the website	365 days, 0:00:00	HTTP
_ga_#	rm.com	Used by Google Analytics to collect data on user visits to the website	365 days, 0:00:00	HTTP
_hp2_#	rm.com	Collects data on the user's navigation and behaviour on the website	1 day, 0:00:00	HTTP
_hp2_id.#	rm.com	Collects data on the user's navigation and behaviour on the website	365 days, 0:00:00	HTTP
_hp2_ses_props.#	rm.com	Collects data on the user's navigation and behaviour on the website	1 day, 0:00:00	HTTP
vuid	vimeo.com	Collects data on the user's visits to the website	365 days, 0:00:00	HTTP

Name	Domain	Purpose	Expiration	Security
td	googletagmanager.com	Used by Google Tag Manager to collect data on the user behaviour and interaction with the website	0:00:00	HTTP
h	heapanalytics.com	Collects data on the user behaviour and interaction with the website	0:00:00	HTTP