Research

New Research Confirms: RM Compare Delivers Reliable and Fair Spoken Language Assessment

By Mark House

13th jun 2025

The landscape of language assessment is evolving rapidly, especially with the 2020 update to the Common European Framework of Reference for Languages (CEFR-CV). This update redefines intelligibility—now encompassing both actual understanding and the perceived ease of understanding—making it a more holistic and communicative measure. But how do we assess this complex construct reliably and at scale? Enter Adaptive Comparative Judgement (ACJ) and, specifically, RM Compare.

A new study by Wang and Zheng (2025) puts RM Compare at the centre of this conversation, rigorously examining its potential to deliver on the CEFR-CV’s ambitious goals for pronunciation and speaking assessment.

Wang / Zheng Research Paper (2025)

Assessing intelligibility as conceptualised within the CEFR-companion volume (CV) framework using Adaptive Comparative Judgement

View PDF

The Challenge: Measuring a Broader, Real-World Intelligibility

Traditionally, intelligibility and comprehensibility were treated as separate constructs in second language (L2) pronunciation research. The CEFR-CV merges these, defining intelligibility as both the listener's actual understanding and their perceived difficulty in understanding. This shift aligns assessment with real-world communication but also exposes the limitations of existing methods—like transcription tasks or Likert scales—which are either too narrow or conflate the two constructs

The Solution: Adaptive Comparative Judgement with RM Compare

Wang & Zheng’s study piloted ACJ using RM Compare, where judges compared pairs of audio samples from Mandarin-speaking learners of English. Judges decided which sample was more intelligible, using the new CEFR-CV definition as their guide. The process was supported by both quantitative (statistical reliability/validity) and qualitative (think-aloud protocols, interviews) data.

Key Findings

Reliability

ACJ via RM Compare produced highly reliable results (SSR ≥ 0.90), consistent across both experienced and novice judges. Results were robust when cross-validated with established tools like FACETS

Validity

ACJ rankings strongly correlated with traditional rubric-based scores (ρ = 0.82–0.90). Judges’ decision criteria aligned with linguistic features known to affect intelligibility and comprehensibility

User Experience

Judges found ACJ reliable, comprehensive, and flexible.

What This Means for RM Compare

Validation for CEFR-CV Assessment

The study provides strong evidence that RM Compare is a reliable and valid platform for assessing the new, holistic intelligibility construct in the CEFR-CV. This is a significant endorsement for its use in both research and operational assessment contexts

Unique Suitability for Speaking Assessment

RM Compare’s support for audio and video uploads makes it uniquely suited for pronunciation and speaking assessment—areas where most ACJ research has focused on written outputs

Efficiency and Flexibility

ACJ allows judges to make holistic, comparative decisions quickly and with minimal training, reducing workload while maintaining high reliability. This is a major advantage for large-scale or time-constrained assessments

Addressing Reliability Concerns

The study confirms that RM Compare’s ACJ implementation addresses concerns about inflated reliability, making it suitable even for high-stakes contexts.

Looking Ahead: Opportunities for Innovation

Wang & Zheng’s research highlights several areas for further development:

Nuanced Response Options: Allowing judges to indicate “similar proficiency” could improve fairness and analytic insight.
Broader Task Types and Judge Demographics: Expanding to more diverse speaking tasks and judge backgrounds will enhance authenticity and generalisability.
Analytic and Reporting Features: Enhanced cross-validation and reporting tools will further strengthen this use case.

Conclusion

The Wang & Zheng (2025) study is a milestone for RM Compare, demonstrating its capability to deliver valid, reliable, and scalable assessment of the CEFR-CV’s new intelligibility construct. As language assessment moves towards more communicative, real-world measures, RM Compare is well-positioned to lead the way—offering efficiency, flexibility, and robust psychometric foundations for the next generation of digital language assessment.

Stay tuned to the RM Compare blog for more insights on innovation in e-assessment and updates on our journey at the forefront of educational technology

Group	Name	Domain	Expiration	Security	Purpose
necessary	csrftoken	compare.rm.com	365 days, 0:00:00	HTTP	Helps prevent CSRF attacks
necessary	_cf_bm	vimeo.com	1 day, 0:00:00	HTTP	Used to distinguish between humans and bots
preferences	wtm	compare.rm.com	365 days, 0:00:00	HTTP	Used to store users cookie preference choices
statistics	_ga	rm.com	365 days, 0:00:00	HTTP	Registers a unique ID used to generate statistical data on how visitor used the website
statistics	_ga_#	rm.com	365 days, 0:00:00	HTTP	Used by Google Analytics to collect data on user visits to the website
statistics	_hp2_#	rm.com	1 day, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	_hp2_id.#	rm.com	365 days, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	_hp2_ses_props.#	rm.com	1 day, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	vuid	vimeo.com	365 days, 0:00:00	HTTP	Collects data on the user's visits to the website
marketing	td	googletagmanager.com	0:00:00	HTTP	Used by Google Tag Manager to collect data on the user behaviour and interaction with the website
marketing	h	heapanalytics.com	0:00:00	HTTP	Collects data on the user behaviour and interaction with the website

Name	Domain	Purpose	Expiration	Security
csrftoken	compare.rm.com	Helps prevent CSRF attacks	365 days, 0:00:00	HTTP
_cf_bm	vimeo.com	Used to distinguish between humans and bots	1 day, 0:00:00	HTTP

Name	Domain	Purpose	Expiration	Security
_ga	rm.com	Registers a unique ID used to generate statistical data on how visitor used the website	365 days, 0:00:00	HTTP
_ga_#	rm.com	Used by Google Analytics to collect data on user visits to the website	365 days, 0:00:00	HTTP
_hp2_#	rm.com	Collects data on the user's navigation and behaviour on the website	1 day, 0:00:00	HTTP
_hp2_id.#	rm.com	Collects data on the user's navigation and behaviour on the website	365 days, 0:00:00	HTTP
_hp2_ses_props.#	rm.com	Collects data on the user's navigation and behaviour on the website	1 day, 0:00:00	HTTP
vuid	vimeo.com	Collects data on the user's visits to the website	365 days, 0:00:00	HTTP

Name	Domain	Purpose	Expiration	Security
td	googletagmanager.com	Used by Google Tag Manager to collect data on the user behaviour and interaction with the website	0:00:00	HTTP
h	heapanalytics.com	Collects data on the user behaviour and interaction with the website	0:00:00	HTTP