Research

Tackling Reliability in Adaptive Comparative Judgement: What RM Compare Users Need to Know

By Mark House

16th jun 2025

If you’ve been following the evolution of digital assessment, you’ll know that Adaptive Comparative Judgement (ACJ) is transforming how we judge quality—especially with platforms like RM Compare. But you might also have heard about concerns over “inflated reliability statistics.” Is this something to worry about? Let’s look at what the research says, and why RM Compare users can be reassured.

Where Did the Concern Come From?

The question of reliability in ACJ was brought into sharp focus by Tom Bramley’s influential research. In his 2015 Cambridge Assessment report, Bramley demonstrated through simulations that the adaptive algorithms used in ACJ could artificially inflate the Scale Separation Reliability (SSR) statistic—even when the underlying data was random. This meant that, in some scenarios, the reliability numbers could look much better than they truly were, especially if the adaptive process started too early or with too few comparisons per item.

Bramley’s work was a crucial wake-up call for the field, highlighting that while adaptivity made ACJ efficient, it could also introduce “spurious separation” among scripts, making SSR alone an unreliable indicator of true reliability.

How Did the Field Respond?

Professor Richard Kimbell, a leading figure in ACJ research and development, took these findings seriously. In his 2022 paper, Kimbell openly acknowledged the issue, describing how the problem was identified and then addressed in collaboration with software developers—including those behind RM Compare. The adaptive algorithm was refined to mitigate the risk of inflated reliability, and new guidance was put in place to ensure that SSR is interpreted in context, not in isolation.

Kimbell’s perspective is pragmatic and reassuring: innovation in assessment is a journey, and the willingness to identify and fix problems is a hallmark of a robust, transparent system

What Does the Latest Research Show?

The most recent and comprehensive evidence comes from Wang & Zheng (2025), who used RM Compare to assess spoken language proficiency. Their study went beyond SSR, validating reliability with split-half methods and cross-checking results using established tools like FACETS. The findings were clear:

High SSR values (≥ 0.90) were matched by strong split-half reliability and robust agreement with traditional scoring methods.
ACJ rankings using RM Compare closely tracked expert judgements and rubric-based scores.
The study confirmed that the platform’s reliability is not an artifact of the algorithm, but reflects genuine consensus among judges.

Wang & Zheng also highlighted that RM Compare’s ACJ implementation addresses the reliability inflation concerns raised by Bramley, making it suitable even for high-stakes assessment

What Does This Mean for RM Compare Users?

You can trust the results. The algorithms in RM Compare have been refined and validated by independent research, including the latest work by Wang & Zheng (2025).
Reliability is multi-faceted. RM Compare supports a range of reliability and validity checks, not just SSR, giving you a fuller picture of assessment quality.
Continuous improvement. The RM Compare team remains committed to staying at the forefront of research and innovation, ensuring the platform evolves with the evidence.

Looking Ahead

Bramley’s early critique was vital in making ACJ—and RM Compare—even stronger. Thanks to ongoing research and development, users can now be confident that RM Compare delivers results that are both efficient and genuinely reliable.

So, whether you’re running a classroom project or a large-scale assessment, you can be assured that RM Compare stands on a foundation of robust, transparent, and continually improving science.

Stay tuned to the RM Compare blog for more insights and updates as we continue to lead the way in digital assessment.

References

Bramley, T. (2015). Investigating the reliability of Adaptive Comparative Judgment. Cambridge Assessment Research Report.
Kimbell, R. (2022). Examining the reliability of Adaptive Comparative Judgement (ACJ) as an assessment tool in educational settings. International Journal of Technology and Design Education, 32(3), 1515-1529.
Wang, Z., & Zheng, Y. (2025). Assessing intelligibility as conceptualised within the CEFR-companion volume (CV) framework using Adaptive Comparative Judgement.

Group	Name	Domain	Expiration	Security	Purpose
necessary	csrftoken	compare.rm.com	365 days, 0:00:00	HTTP	Helps prevent CSRF attacks
necessary	_cf_bm	vimeo.com	1 day, 0:00:00	HTTP	Used to distinguish between humans and bots
preferences	wtm	compare.rm.com	365 days, 0:00:00	HTTP	Used to store users cookie preference choices
statistics	_ga	rm.com	365 days, 0:00:00	HTTP	Registers a unique ID used to generate statistical data on how visitor used the website
statistics	_ga_#	rm.com	365 days, 0:00:00	HTTP	Used by Google Analytics to collect data on user visits to the website
statistics	_hp2_#	rm.com	1 day, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	_hp2_id.#	rm.com	365 days, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	_hp2_ses_props.#	rm.com	1 day, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	vuid	vimeo.com	365 days, 0:00:00	HTTP	Collects data on the user's visits to the website
marketing	td	googletagmanager.com	0:00:00	HTTP	Used by Google Tag Manager to collect data on the user behaviour and interaction with the website
marketing	h	heapanalytics.com	0:00:00	HTTP	Collects data on the user behaviour and interaction with the website

Name	Domain	Purpose	Expiration	Security
csrftoken	compare.rm.com	Helps prevent CSRF attacks	365 days, 0:00:00	HTTP
_cf_bm	vimeo.com	Used to distinguish between humans and bots	1 day, 0:00:00	HTTP

Name	Domain	Purpose	Expiration	Security
_ga	rm.com	Registers a unique ID used to generate statistical data on how visitor used the website	365 days, 0:00:00	HTTP
_ga_#	rm.com	Used by Google Analytics to collect data on user visits to the website	365 days, 0:00:00	HTTP
_hp2_#	rm.com	Collects data on the user's navigation and behaviour on the website	1 day, 0:00:00	HTTP
_hp2_id.#	rm.com	Collects data on the user's navigation and behaviour on the website	365 days, 0:00:00	HTTP
_hp2_ses_props.#	rm.com	Collects data on the user's navigation and behaviour on the website	1 day, 0:00:00	HTTP
vuid	vimeo.com	Collects data on the user's visits to the website	365 days, 0:00:00	HTTP

Name	Domain	Purpose	Expiration	Security
td	googletagmanager.com	Used by Google Tag Manager to collect data on the user behaviour and interaction with the website	0:00:00	HTTP
h	heapanalytics.com	Collects data on the user behaviour and interaction with the website	0:00:00	HTTP