What the latest research tells us about high stakes assessment and ACJ

The recent study in Frontiers in Education, "Beyond Reliability: Examining the Applicability of Adaptive Comparative Judgment (ACJ) in High-Stakes Assessment" (2026) confirms what we at RM Compare have long championed: ACJ (using RM Compare) produces exceptionally high reliability, even in the "messy" and complex world of higher education.

The authors have also done the field a service by naming the friction points practitioners actually experience: workload, the challenge of mapping ranks to grades, and anxieties about transparency and "surface-level" judging. While the descriptions are real, where we differ is in the conclusion that they represent flaws in ACJ itself. Instead these are the inevitable friction that occurs when you try to run a ACJ on Items built for a different form of assessment altogether.

If you use ACJ to mark a task designed for a rubric, you aren't testing the technology, you are testing its ability to mimic an entirely different system. This is a clear category error which unsurprisingly, produces the results and conclusions described in the research.

Using oranges to test for apples is not going to offer much clarity. To move forward, we need a different starting point across six key areas.

1. The "Rubric-Task" Trap

The most significant design issue in contemporary ACJ research is Construct Contamination. Most high-stakes tasks are currently designed for "point-harvesting." They are fragmented into small, discrete questions (1a, 1b, etc.) specifically so a marker can tick boxes on a rubric.

When you force experts to compare these fragmented scripts, you force them into mental accounting. They end up looking for specific fragments rather than evaluating the quality of the work. This is what leads to the "surface-level" bias noted in the study.

The Shift: ACJ is built for holistic, integrated tasks. If a response can be easily marked by a rubric, it probably shouldn't be done with ACJ. We should use this technology to assess synthesis, professional flair, and "the sense of the whole."

2. The Expertise Paradox: Speed ≠ Shallow

The study suggests that short judgment times lead to shallow evaluations. This rests on the assumption that slowness is a proxy for rigor. We challenge that.

Recognition vs. Calculation: In fields like emergency medicine or elite sports, we value expert intuition and the ability to recognise complex patterns instantly. A rubric forces a marker to act like a human calculator, adding up parts to find a sum. ACJ allows the judge to function as a professional, recognizing quality through comparison rather than laboriously itemising it.

Fast decisions can still be deeply informed decisions when they are made by genuine experts working with well-designed, holistic tasks. The issue is not speed per se, but whether the task and judging context allow expertise to operate in recognition mode rather than checklist mode.

3. Grading as Policy, Not Mathematics

A major friction point cited was the difficulty of mapping rank orders to grades. This is what the researchers called "post-hoc thresholding." At RM Compare, we view this as a moment of technological honesty, not a design flaw.

The Linear Ruler: We have solved the mathematical challenge of converting logarithmic "True Scores" (logits) into a linear 0–100 scale. This turns a "rank" into a "ruler" that educators actually understand and can work with.

Standard Setting: Grade conversion is a Standard Setting exercise. By choosing boundary scripts that embody a "Pass" or a "First," humans make an intentional, defensible policy decision. Critics frame this as a disadvantage and one more step, one more meeting. We see it as grade validity. A rubric creates the illusion that grades write themselves from accumulated points, but those thresholds (for example, 50% = pass, 70% = distinction) are also policy decisions; they are just hidden inside the rubric's point structure. ACJ makes the decision explicit and auditable: "These scripts represent a Pass. Do you agree?" That is not extra work; it is honest work. ACJ doesn't hide the grading process; it makes it a conscious, expert-led act.

4. Transparency via the Digital Audit Trail

Critics worry that without red ink on a paper, students can't see why they got their grade. We agree that transparency matters; we disagree that rubrics have a monopoly on it.

Beyond the Red Pen: Real transparency isn't a checkbox; it’s an exemplar. By showing students their work alongside scripts that ranked slightly higher (or slightly lower), we provide a "map" for improvement that a rubric-based number can never match. Students see what "better" actually looks like, not just which boxes they failed to tick.

The Appeals Process: In a traditional appeal, you get a second opinion. In RM Compare, we interrogate the process. We can see exactly how many times a script was seen, who saw it, how long decisions took, and how it sits in its local neighbourhood of similar work. This digital audit trail supports robust quality assurance and gives institutions more to work with in an appeal than a single annotated script ever could.

5. Adaptive Efficiency vs. Static Rounds

Many academic studies use fixed-round protocols (for example, every paper seen 12 times) for scientific control. This is a poor way to run a real-world assessment and is the primary driver of "workload" complaints.

The Solution: We use Adaptive Scaling. Our algorithm identifies which scripts are already "stable" on the map and stops showing them, instead focusing judge effort on the scripts that are harder to separate. In other words, we allocate cognitive effort where the mathematical tension is, not evenly across all scripts regardless of need.

In a high-stakes setting, efficiency comes from this targeted use of expert time, not from insisting that every script must be seen an arbitrary number of times. The research protocols that generate high workload figures are often measuring a design that no practitioner should ever be forced to use.

6. Surfacing the "Common View"

Finally, there is the fear that without a rubric, judging becomes "subjective."

Social Moderation: A rubric assumes consensus; ACJ builds it. Through the process of comparison, a Common View of quality emerges from the community of experts. The consensus lives in the pattern of decisions, not in a document written in advance.

Misfit Detection: Unlike a rubric, which can be followed "correctly" by a poor marker, ACJ identifies misfit judges - those whose decisions are out of step with the collective professional standard. This can be seen in their statistical misfit and in the instability their decisions introduce. We aren't losing criteria; we are surfacing the actual criteria the expert community values and detecting when individuals deviate from them in ways that threaten fairness. In rubric-based systems, 'rater drift' and idiosyncratic marking are often discovered only during moderation sampling, sometimes even after grades are issued; in ACJ, misfit can be surfaced during the process, when you can still act.

Conclusion: Designing for a Comparative World

The research study is interesting. It shows that ACJ applied to rubric-designed tasks can produce the reliability we need, but at a cost in transparency and workload that institutions are right to question. The answer is not to abandon ACJ, it is to design for it from the start: holistic tasks, adaptive efficiency, exemplar-based feedback, and explicit standard-setting.

ACJ is not a poor imitation of traditional marking; it is a different way of knowing and communicating what quality looks like.