New Research Confirms: RM Compare Delivers Reliable and Fair Spoken Language Assessment

The landscape of language assessment is evolving rapidly, especially with the 2020 update to the Common European Framework of Reference for Languages (CEFR-CV). This update redefines intelligibility—now encompassing both actual understanding and the perceived ease of understanding—making it a more holistic and communicative measure. But how do we assess this complex construct reliably and at scale? Enter Adaptive Comparative Judgement (ACJ) and, specifically, RM Compare.

A new study by Wang and Zheng (2025) puts RM Compare at the centre of this conversation, rigorously examining its potential to deliver on the CEFR-CV’s ambitious goals for pronunciation and speaking assessment.

Wang / Zheng Research Paper (2025)

Assessing intelligibility as conceptualised within the CEFR-companion volume (CV) framework using Adaptive Comparative Judgement

View PDF

The Challenge: Measuring a Broader, Real-World Intelligibility

Traditionally, intelligibility and comprehensibility were treated as separate constructs in second language (L2) pronunciation research. The CEFR-CV merges these, defining intelligibility as both the listener's actual understanding and their perceived difficulty in understanding. This shift aligns assessment with real-world communication but also exposes the limitations of existing methods—like transcription tasks or Likert scales—which are either too narrow or conflate the two constructs

The Solution: Adaptive Comparative Judgement with RM Compare

Wang & Zheng’s study piloted ACJ using RM Compare, where judges compared pairs of audio samples from Mandarin-speaking learners of English. Judges decided which sample was more intelligible, using the new CEFR-CV definition as their guide. The process was supported by both quantitative (statistical reliability/validity) and qualitative (think-aloud protocols, interviews) data.

Key Findings

Reliability

ACJ via RM Compare produced highly reliable results (SSR ≥ 0.90), consistent across both experienced and novice judges. Results were robust when cross-validated with established tools like FACETS

Validity

ACJ rankings strongly correlated with traditional rubric-based scores (ρ = 0.82–0.90). Judges’ decision criteria aligned with linguistic features known to affect intelligibility and comprehensibility

User Experience

Judges found ACJ reliable, comprehensive, and flexible.

What This Means for RM Compare

Validation for CEFR-CV Assessment

The study provides strong evidence that RM Compare is a reliable and valid platform for assessing the new, holistic intelligibility construct in the CEFR-CV. This is a significant endorsement for its use in both research and operational assessment contexts

Unique Suitability for Speaking Assessment

RM Compare’s support for audio and video uploads makes it uniquely suited for pronunciation and speaking assessment—areas where most ACJ research has focused on written outputs

Efficiency and Flexibility

ACJ allows judges to make holistic, comparative decisions quickly and with minimal training, reducing workload while maintaining high reliability. This is a major advantage for large-scale or time-constrained assessments

Addressing Reliability Concerns

The study confirms that RM Compare’s ACJ implementation addresses concerns about inflated reliability, making it suitable even for high-stakes contexts.

Looking Ahead: Opportunities for Innovation

Wang & Zheng’s research highlights several areas for further development:

  • Nuanced Response Options: Allowing judges to indicate “similar proficiency” could improve fairness and analytic insight.
  • Broader Task Types and Judge Demographics: Expanding to more diverse speaking tasks and judge backgrounds will enhance authenticity and generalisability.
  • Analytic and Reporting Features: Enhanced cross-validation and reporting tools will further strengthen this use case.

Conclusion

The Wang & Zheng (2025) study is a milestone for RM Compare, demonstrating its capability to deliver valid, reliable, and scalable assessment of the CEFR-CV’s new intelligibility construct. As language assessment moves towards more communicative, real-world measures, RM Compare is well-positioned to lead the way—offering efficiency, flexibility, and robust psychometric foundations for the next generation of digital language assessment.

Stay tuned to the RM Compare blog for more insights on innovation in e-assessment and updates on our journey at the forefront of educational technology