- Research
New Research Confirms: RM Compare Delivers Reliable and Fair Spoken Language Assessment

The landscape of language assessment is evolving rapidly, especially with the 2020 update to the Common European Framework of Reference for Languages (CEFR-CV). This update redefines intelligibility—now encompassing both actual understanding and the perceived ease of understanding—making it a more holistic and communicative measure. But how do we assess this complex construct reliably and at scale? Enter Adaptive Comparative Judgement (ACJ) and, specifically, RM Compare.
A new study by Wang and Zheng (2025) puts RM Compare at the centre of this conversation, rigorously examining its potential to deliver on the CEFR-CV’s ambitious goals for pronunciation and speaking assessment.
Wang / Zheng Research Paper (2025)
Assessing intelligibility as conceptualised within the CEFR-companion volume (CV) framework using Adaptive Comparative Judgement
The Challenge: Measuring a Broader, Real-World Intelligibility
Traditionally, intelligibility and comprehensibility were treated as separate constructs in second language (L2) pronunciation research. The CEFR-CV merges these, defining intelligibility as both the listener's actual understanding and their perceived difficulty in understanding. This shift aligns assessment with real-world communication but also exposes the limitations of existing methods—like transcription tasks or Likert scales—which are either too narrow or conflate the two constructs
The Solution: Adaptive Comparative Judgement with RM Compare
Wang & Zheng’s study piloted ACJ using RM Compare, where judges compared pairs of audio samples from Mandarin-speaking learners of English. Judges decided which sample was more intelligible, using the new CEFR-CV definition as their guide. The process was supported by both quantitative (statistical reliability/validity) and qualitative (think-aloud protocols, interviews) data.
Key Findings
Reliability
ACJ via RM Compare produced highly reliable results (SSR ≥ 0.90), consistent across both experienced and novice judges. Results were robust when cross-validated with established tools like FACETS
Validity
ACJ rankings strongly correlated with traditional rubric-based scores (ρ = 0.82–0.90). Judges’ decision criteria aligned with linguistic features known to affect intelligibility and comprehensibility
User Experience
Judges found ACJ reliable, comprehensive, and flexible.
What This Means for RM Compare
Validation for CEFR-CV Assessment
The study provides strong evidence that RM Compare is a reliable and valid platform for assessing the new, holistic intelligibility construct in the CEFR-CV. This is a significant endorsement for its use in both research and operational assessment contexts
Unique Suitability for Speaking Assessment
RM Compare’s support for audio and video uploads makes it uniquely suited for pronunciation and speaking assessment—areas where most ACJ research has focused on written outputs
Efficiency and Flexibility
ACJ allows judges to make holistic, comparative decisions quickly and with minimal training, reducing workload while maintaining high reliability. This is a major advantage for large-scale or time-constrained assessments
Addressing Reliability Concerns
The study confirms that RM Compare’s ACJ implementation addresses concerns about inflated reliability, making it suitable even for high-stakes contexts.
Looking Ahead: Opportunities for Innovation
Wang & Zheng’s research highlights several areas for further development:
- Nuanced Response Options: Allowing judges to indicate “similar proficiency” could improve fairness and analytic insight.
- Broader Task Types and Judge Demographics: Expanding to more diverse speaking tasks and judge backgrounds will enhance authenticity and generalisability.
- Analytic and Reporting Features: Enhanced cross-validation and reporting tools will further strengthen this use case.
Conclusion
The Wang & Zheng (2025) study is a milestone for RM Compare, demonstrating its capability to deliver valid, reliable, and scalable assessment of the CEFR-CV’s new intelligibility construct. As language assessment moves towards more communicative, real-world measures, RM Compare is well-positioned to lead the way—offering efficiency, flexibility, and robust psychometric foundations for the next generation of digital language assessment.
Stay tuned to the RM Compare blog for more insights on innovation in e-assessment and updates on our journey at the forefront of educational technology