Research

What the latest ACJ research means for real‑world assessment

By Mark House

13th mar 2026

A new study has given Adaptive Comparative Judgement (ACJ) one of its toughest tests yet: using it to assess long, complex law essays in a real university context. The results are encouraging for anyone interested in more reliable, fair and meaningful assessment – and they also highlight some very practical design questions we, as a community, need to solve together.

In this post, I’ll outline what the researchers found, why it matters for universities and exam boards, and how RM Compare is already responding in the way we design and support ACJ projects.

The paper

The paper is called Beyond reliability: examining the applicability of adaptive comparative judgment in high-stakes assessment (13th March 2026). The researchers, were Kjetil Egelandsdal and Jan-Over Faerstad from the University of Bergan. They used RM Compare to take a look at assessment of 3rd year Law courses in Norwegien education.

A demanding real‑world test – with promising results

The study set out to see whether ACJ could be used to grade third‑year law scripts, and how its outcomes compared with traditional marking.

Despite the deliberately challenging context, the findings suggest that ACJ can perform at least as well as conventional marking on several key fronts:

Reliability was high, even with long, complex responses and demanding judgement calls.
Grades derived from ACJ were broadly in line with those produced by traditional marking, with most candidates receiving identical or adjacent grades.
The system successfully handled a genuine “live” setting rather than a small, tightly controlled lab study.

In other words: when you ask experienced assessors to make holistic comparative judgements of quality, you can obtain stable, defensible outcomes even on some of the most challenging kinds of student work.

Perhaps most strikingly, almost all of the examiners were new to ACJ. After a short briefing and access to guides, they were able to use RM Compare to generate highly reliable outcomes that closely tracked conventional law‑school grades.

For HE leaders and exam boards, that’s an important message: ACJ is not just a niche research technique; it is increasingly being shown to work in the messy reality of real cohorts, real stakes and real constraints.

What the study surfaced about real‑world constraints

As well as the positives, the authors surface a set of concerns that will resonate with anyone responsible for large‑scale assessment.

They flag questions around:

Workload and feasibility – How many judgements does this really take? How long do they take? How does this sit alongside existing marking and moderation commitments?
Depth versus speed – When judges make many rapid decisions on long responses, is there a risk of favouring fluent writing and structure over deeper reasoning or doctrinal accuracy?
Grade boundaries and transparency – Once you have a rank order from ACJ, how do you turn that into grades in a way that is efficient, defensible and understandable to stakeholders?

Crucially, these concerns are not about whether ACJ “works” at all; they are about how we design ACJ implementations so that they fit real courses, real timetables and real expectations of fairness and transparency.

The fact that untrained examiners could achieve this level of consistency so quickly is encouraging. The study also reminds us that, especially for long, complex scripts, we need to support examiners to focus on the right aspects of quality. That’s why we emphasise task design, calibration activities, and sensible judgement loads in RM Compare projects.

From our perspective, that makes the study valuable. It surfaces exactly the issues that institutions need to consider when moving from pilots to operational use.

How RM Compare is responding in practice

At RM Compare we recognise these concerns, and we see them as design challenges rather than show‑stoppers. Over the past few years, and in light of this latest work, we’ve been focusing on four practical areas.

1. Matching ACJ to the right tasks

ACJ is at its best when judges can form a holistic view of quality in a reasonable timeframe. That has implications for task design.

We work with partners to:

Identify which assessments are strong candidates for ACJ (e.g. extended writing, projects, design work, performance‑based tasks).
Shape prompts, marking schemes and exemplars so that the construct being judged is clear, shared and visible in students’ work.
Avoid using ACJ on large volumes of text where the construct is narrowly defined and better suited to analytic criteria.

In many institutions, this means using ACJ alongside, not instead of, conventional marking – each used where it makes most sense.

2. Designing for feasibility and reliability together

The study shows that if you run ACJ in a “research‑heavy” configuration – lots of judgements, fixed designs – workload can feel substantial. That is the point of stress‑testing; but it’s not the only way to run ACJ.

In RM Compare projects we:

Use adaptive engines that concentrate judgements where they are most informative, reducing the overall number needed to reach target reliability.
Model different design options before projects go live, so institutions can see the trade‑offs between reliability, assessor numbers and total judgement counts.
Encourage sensible limits on session length and daily judgement loads to protect quality and wellbeing.

The goal is simple: high‑quality outcomes, achieved in a way that sits comfortably within real staffing and timetabling constraints.

3. Making grade setting more visible and manageable

Turning ACJ outcomes into grades is often where questions arise, but those questions are very familiar to anyone who has ever set grade boundaries on exam scripts.

In response, we are:

Developing clearer workflows for using boundary scripts and standard‑setting panels within RM Compare, so that decisions are structured, documented and repeatable.
Investing in visual tools that help panels see how different boundary choices affect candidates, and to explore “what if” scenarios safely before finalising outcomes.
Providing audit trails and reports that make it easier to explain and defend decisions to internal quality committees and external reviewers.

Rather than hiding judgement behind opaque calculations, we want ACJ to make the process of standard‑setting more transparent and more evidence‑informed.

4. Supporting learning, not just grading

One of the most encouraging developments in the wider ACJ literature is the move from “Can we grade with this?” to “How can this improve learning?”

RM Compare is designed to support that shift by:

Turning rank orders into rich banks of exemplars that students and staff can use to calibrate their own understanding of quality.
Enabling peer‑assessment and “learning by evaluating” activities, where students learn through making judgements themselves.
Providing data and analytics that help educators see patterns in judgement and performance that are hard to spot via traditional marking alone.

The law study focuses mainly on summative grading; our work with partners extends that conversation into feedback, curriculum design and student development.

A living standard that grows over time

One of the most important implications of this research, especially for exam boards and qualification providers, is what it suggests about standards not as a one‑off event, but as something that can grow over time.

An initial ACJ session in RM Compare establishes a defensible standard: a ranked set of real scripts, with clear boundary work that reflects expert judgement at a given point in time. But that is only the starting point. As more series run, more cohorts are judged and more boundary‑setting panels meet, that standard can be enriched organically with:

Additional exemplars at key points on the scale
Refined boundary scripts and annotations
Evidence about how different types of responses sit relative to existing standards

In the emerging modular ecosystem (Studio, Hub and Live), those living standards can be curated centrally and then reused wherever they are needed: to support awarding, to check and adjust grade boundaries, to train and monitor examiners, and to power on‑demand comparative judgements in RM Compare Live. Instead of recreating standards from scratch each time, exam boards can build a single, continuously developing, evidence‑rich picture of performance that gets stronger and more nuanced with every assessment cycle

What this means for universities and exam boards

Taken together, the latest research sends a balanced but ultimately reassuring message.

For institutions, it suggests that:

ACJ can deliver reliability that meets or exceeds traditional approaches, even in demanding contexts.
Outcomes are generally convergent with what experienced markers produce today, which is important for fairness and stakeholder confidence.
The biggest challenges lie in design and implementation – choosing the right tasks, configuring the process, supporting judges and making grade‑setting explicit.

For RM Compare, it confirms that we are asking the right questions and investing in the right areas: feasibility, transparency, task alignment and the wider learning value of comparative judgement.

We’ll continue to work closely with researchers and practitioners to refine how ACJ is used in practice. Studies like this don’t undermine ACJ – they help us understand where it works best, what needs careful thought, and how tools like RM Compare can support institutions in using it well.

If you’re considering ACJ for your own context and would like to talk through what this research might mean for your subject, cohort or assessment model, we’d be very happy to explore that with you.

Group	Name	Domain	Expiration	Security	Purpose
necessary	csrftoken	compare.rm.com	365 days, 0:00:00	HTTP	Helps prevent CSRF attacks
necessary	_cf_bm	vimeo.com	1 day, 0:00:00	HTTP	Used to distinguish between humans and bots
preferences	wtm	compare.rm.com	365 days, 0:00:00	HTTP	Used to store users cookie preference choices
statistics	_ga	rm.com	365 days, 0:00:00	HTTP	Registers a unique ID used to generate statistical data on how visitor used the website
statistics	_ga_#	rm.com	365 days, 0:00:00	HTTP	Used by Google Analytics to collect data on user visits to the website
statistics	_hp2_#	rm.com	1 day, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	_hp2_id.#	rm.com	365 days, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	_hp2_ses_props.#	rm.com	1 day, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	vuid	vimeo.com	365 days, 0:00:00	HTTP	Collects data on the user's visits to the website
marketing	td	googletagmanager.com	0:00:00	HTTP	Used by Google Tag Manager to collect data on the user behaviour and interaction with the website
marketing	h	heapanalytics.com	0:00:00	HTTP	Collects data on the user behaviour and interaction with the website

Name	Domain	Purpose	Expiration	Security
csrftoken	compare.rm.com	Helps prevent CSRF attacks	365 days, 0:00:00	HTTP
_cf_bm	vimeo.com	Used to distinguish between humans and bots	1 day, 0:00:00	HTTP

Name	Domain	Purpose	Expiration	Security
_ga	rm.com	Registers a unique ID used to generate statistical data on how visitor used the website	365 days, 0:00:00	HTTP
_ga_#	rm.com	Used by Google Analytics to collect data on user visits to the website	365 days, 0:00:00	HTTP
_hp2_#	rm.com	Collects data on the user's navigation and behaviour on the website	1 day, 0:00:00	HTTP
_hp2_id.#	rm.com	Collects data on the user's navigation and behaviour on the website	365 days, 0:00:00	HTTP
_hp2_ses_props.#	rm.com	Collects data on the user's navigation and behaviour on the website	1 day, 0:00:00	HTTP
vuid	vimeo.com	Collects data on the user's visits to the website	365 days, 0:00:00	HTTP

Name	Domain	Purpose	Expiration	Security
td	googletagmanager.com	Used by Google Tag Manager to collect data on the user behaviour and interaction with the website	0:00:00	HTTP
h	heapanalytics.com	Collects data on the user behaviour and interaction with the website	0:00:00	HTTP