- Research
How to enjoy fries on the beach, undisturbed by seagulls: Surprising Truths from a recent ACJ Study
A recent study (2025) from Jeffrey Buckley (Technological University of the Shannon) and Caiwei Zhu (Delft University of Technology) set out to answer a critical question for anyone seeking fairness and efficiency in educational assessment: How feasible is Adaptive Comparative Judgement (ACJ) when deployed in real classrooms?
To find out, they recruited 20 industrial designers to evaluate over 200 anonymized design portfolios created by primary and secondary school pupils tackling a real-world problem - how to enjoy fries on the beach, undisturbed by seagulls!!. Using an online ACJ platform, judges compared pairs of student work, making binary holistic judgements, while their decision times and patterns were carefully recorded. The study aimed to uncover whether difficult comparisons took longer, if judge fatigue set in across assessment rounds, and what these findings could mean for scaling ACJ - especially for platforms like RM Compare.
The results? Six key lessons - and each has a clear implication for anyone using RM Compare to deliver reliable, scalable, and transparent assessment.
Six key lessons
1. Difficult Judgements Don’t Take Longer
Learning: Judges spent the same amount of time on “easy” and “difficult” pairings, whether student work was similar in quality or obviously different.
Implication: RM Compare sessions can be scheduled with predictable timing, without bloated buffers for complex comparisons. Reliable planning gets easier.
2. Judge Fatigue Rarely Limits Assessment Quality
Learning: Across dozens of consecutive judgements, most assessors were consistent in speed and reliability. Only a handful displayed minor pacing shifts or signs of fatigue and even then, the trends varied by individual.
Implication: RM Compare is robust for classroom, department, or school-wide rollout. Fatigue is not a fundamental blocker for scalability.
3. Session Timing Is Predictable and Manageable
Learning: With judges averaging around 50 seconds per judgement, ACJ sessions are even faster than previous studies suggested, making the process manageable for busy educators.
Implication: RM Compare lets teachers and leaders plan sessions and reporting workflows with clear, evidence-based expectations reducing guesswork.
4. Algorithmic Efficiency Sets the Ceiling for Scale
Learning: The true limit for scaling ACJ isn’t slow human decision-making, it’s the number of pairings needed to reach robust rankings.
Implication: RM Compare’s ongoing technical advances continue to focus on refining its pairing and ranking algorithms, delivering faster results and lowering costs for users.
5. Intuition Drives Effective Judgement
Learning: Assessors relied on fast, instinctive choice even for challenging calls and not on laborious analysis. Professional heuristics powered reliable outcomes.
Implication: RM Compare continues to design for speed, prioritising intuitive prompts and interfaces that support teachers’ expertise, rather than demanding exhaustive evidence for every decision.
6. Transparency Earns Trust (and Adoption)
Learning: Many teachers and stakeholders prefer traditional rubrics, making ACJ’s less familiar processes a barrier unless the system is clearly explained.
Implication: RM Compare will keep investing in onboarding, help guides, and transparent communication - making algorithms and results understandable for both experts and newcomers. We are also working hard toward a refreshed user experience (news coming soon!).
Bottom line
Buckley and Zhu’s research shows ACJ is not only feasible and efficient, it’s ready for the real demands of today’s assessment environments, when paired with smart technology and strong user support. RM Compare is built to put these lessons into practice, helping educators deliver fast, fair, and transparent judgement at every scale.
Finally, while these results are promising the report notes that "they should be interpreted cautiously. The findings are based on a specific task and sample, and replication in additional studies with varied design contexts, age groups, and judgement volumes is necessary to confirm their generalisability."