- Opinion
Playing Nicely with Rubrics: How RM Compare Enhances Absolute Assessment in the Age of AI
In a landscape where generative AI has eroded confidence in traditional written assessment signals, education leaders face an uncomfortable truth: the rubrics they've carefully crafted may no longer be fit for purpose on their own. We've written extensively on this topic before, including research from Princeton and Dartmouth showing how generative AI has damaged the signalling value of written applications, and why this demands fundamentally new assessment approaches. Yet abandoning rubrics entirely isn't the answer. Instead, institutions are discovering that the most powerful approach combines the best of both worlds, the intentional clarity of rubrics with the empirical insight of comparative judgment. RM Compare sits at this intersection, refining how rubrics work in practice and revealing the gap between what educators intend and what they actually assess.
The Assessment Crisis and the Curriculum Response
When students can generate high-quality prose at the click of a button, traditional written performance tasks become unreliable proxies for thinking, synthesis, and authentic capability.
Education systems are responding. New curricula increasingly emphasise applied thinking, problem-solving, and creative performance skills that seem harder for AI to fake convincingly. Applied projects, design work, structured problem-solving, and multimodal tasks are replacing straightforward essay assessments. But here's the tension: as assessment tasks become more complex and less formulaic, rubrics - the traditional tool for making assessment transparent and consistent - begin to strain under the weight of what they're being asked to do.
A rubric assumes that quality can be deconstructed into discrete criteria, each with clear performance descriptors. In theory, markers read a submission, tick the boxes that fit, and arrive at a reliable score. In practice, assessing complex, creative, or novel work is far messier. Markers interpret the same descriptor differently. They struggle with tasks that don't fit neatly into rubric categories. They drift over time and what counted as "excellent" on the tenth essay may look different by the fiftieth. And when every submission is potentially AI-assisted or AI-augmented, the subtle signals that reveal authentic thinking are harder for a checklist to capture.
The result is that many institutions are running two parallel assessment systems: rubrics for transparency and feedback, and something less formal (sometimes just professional intuition) for the high-stakes judgement calls. This dualism is costly, opaque, and often unfair.
Rubrics Define Intention; Comparative Judgment Reveals Reality
This is where RM Compare introduces a fundamentally different approach to absolute assessment. Rather than replacing rubrics, Adaptive Comparative Judgement (ACJ) works alongside them to refine how they're applied and to make assessment more empirically grounded.
Think of rubrics and comparative judgment as answering two different questions:
- Rubrics answer: "What should matter in this assessment?" They express the curriculum designer's intentions. They set out criteria that align with learning outcomes, disciplinary standards, and institutional values.
- Comparative judgment answers: "What actually matters when experts assess this work?" By having multiple assessors compare pairs of student submissions and choose which is better, you generate data about how those criteria are really being interpreted and applied.
In many cases, the two align perfectly. But in others, they don't. A rubric might emphasise "creativity and originality," but when educators actually compare student work, they might consistently favour submissions that also show methodological rigour. Or a rubric criterion about "clarity of communication" might prove less discriminating than anticipated and most submissions either have it or don't, and it's not where excellent work truly separates from good work.
RM Compare gives institutions an empirical window into these gaps. After running a comparative judgment session, you can analyse the patterns: Which criteria were frequently mentioned in assessor comments? Which criteria seemed to drive the final rankings? Were there criteria that sounded important but weren't actually predictive of perceived quality? This is where rubric calibration becomes evidence-based rather than purely aspirational.
Closing the Theory–Practice Gap in the AI Era
The AI crisis has made this problem acute. When assessment stakes are high and the temptation to use generative tools is real, rubrics alone are insufficient. A rubric might specify that submissions should show "original thinking" or "synthesis across sources," but an AI-generated response can fake these descriptors convincingly. The surface-level signals that rubrics rely on are increasingly unreliable.
Comparative judgment, by contrast, invites human judgment to do what it does best: recognise the subtle, holistic markers of authentic thought. When educators compare two pieces of work side by side, they often know which one rings true without being able to articulate why against a rubric. They notice tone, the coherence of argument, the way evidence builds or the way an idea develops. These are precisely the signals that are harder for AI to replicate convincingly.
By combining rubric frameworks with comparative judgment, institutions create a dual filter:
- Rubrics set the public standard - they communicate to students, families, and the broader community what the assessment values and what success looks like.
- Comparative judgment validates the standard - it checks whether those rubric criteria are actually predictive of professional consensus on quality, and whether assessors are interpreting them consistently.
The result is assessment that is both transparent and robust. Transparency comes from the rubric; robustness comes from the collective human judgment that has been structured and validated through comparative assessment.
Professional Dialogue, Not Just Consistency
Traditional rubric-based assessment can feel isolating. A teacher applies a rubric to a batch of essays alone, at night, clicking through checklist items. Calibration meetings happen annually, if at all. The result is that teachers rarely see what quality looks like through their colleagues' eyes, and assessment standards slowly drift.
Comparative judgment inverts this dynamic. The process is inherently social. Assessors compare work, see how their peers are judging, and engage in dialogue about what makes work excellent. The software captures their comments - "This shows deeper synthesis," "The evidence here is more specific" - creating a shared language around quality.
This dialogue is a form of professional development that rubrics alone cannot provide. When a teacher participates in a comparative judgment session, they're essentially asking: What do my colleagues think matters here? Where do I agree, and where do we differ? Over time, this builds shared understanding and collective efficacy in assessment.
Moreover, comparative judgment compresses the moderation cycle. Instead of marking individual scripts in isolation and then spending hours in calibration meetings trying to align decisions after the fact, a comparative judgment session produces reliable consensus much more quickly. Teachers see the pattern of decisions emerge in real time, and misalignments can be discussed as they arise rather than discovered months later in the data.
Trust and Transparency in the AI Age
As assessment practices adapt to an AI-saturated world, trust becomes currency. Students and families need to know that grades reflect genuine capability, not algorithmic noise. Employers need to trust that qualifications mean something. Institutions need to be able to defend their assessment decisions to regulators and to themselves.
Rubrics contribute to this trust by making assessment criteria explicit. But explicitness alone isn't enough and you also need to demonstrate that the criteria are being applied fairly and that the outcomes are reliable.
RM Compare provides an audit trail. Every decision is logged. The adaptive algorithm ensures that contentious submissions are seen by multiple assessors and that consensus is genuine rather than assumed. If a parent or regulator asks, "Why did this student receive this grade?", the institution can point not just to a rubric but to evidence: Multiple expert assessors compared this work against others, and here's how it ranked. Here are the comments they left. Here's the reasoning.
This transparency is especially valuable in an era of AI scepticism. Students and families are increasingly aware that AI can fake surface features of quality. By showing that assessment has involved human judgment where real people, making real comparisons, justifying their choices, institutions signal that they're looking for genuine capability, not just textual polish.
Practical Pathways: Making Rubrics and Comparative Judgment Work Together
So how do institutions actually implement this complementary approach? There are several pragmatic models:
Rubric-first, judgment-second. Design your rubric carefully, aligned with curriculum and learning outcomes. Then, gather a representative sample of student submissions and run an RM Compare session to check alignment. Do the rubric criteria predict the rankings that comparative judgment produces? If a rubric criterion doesn't correlate with how experts actually judge quality, that's data worth acting on. You might refine the rubric, add descriptors, or deprioritise that criterion in feedback while still using it in calibration.
Blended validation. Some institutions use comparative judgment in parallel with rubric marking. Assessors apply the rubric in the usual way, generating a score. They also participate in comparative judgment. Then you can correlate: Do submissions that rank highly in comparative judgment also score well on the rubric? Are there systematic biases (e.g., does the rubric consistently overvalue certain criteria relative to expert perception?). This reveals where training or rubric refinement is needed.
Benchmarking and exemplification. Use RM Compare to identify exemplars at different quality levels, then use those exemplars to sharpen rubric descriptors. Instead of writing abstract language about what "excellent" looks like, you can show the actual student work that the cohort collectively identified as excellent, good, or developing. This makes rubrics lived and real, not theoretical.
High-stakes moderation. For subjects or qualifications where stakes are high, use rubrics for feedback and classroom assessment, but use RM Compare for final moderation and grading decisions. This preserves the pedagogical value of rubrics while ensuring that summative judgements have the reliability and fairness that comparative judgment provides.
From Intention to Evidence: A New Maturity in Assessment
Educational assessment is maturing. We're moving beyond the assumption that marking (assigning a number to each individual submission) is the primary goal of assessment. Instead, we're asking harder questions: What does quality really look like? How do we know our rubrics describe reality? How do we ensure that all students are assessed fairly, regardless of who marked their work?
RM Compare isn't a replacement for rubrics. It's a maturation of how rubrics are used. It brings empirical validation to rubric design, builds professional consensus around standards, and creates an audit trail that institutions can defend with confidence. It harnesses the collective wisdom of experienced assessors and makes it visible and actionable.
As curricula evolve to address the AI era - emphasising applied thinking, creativity, and authentic performance - the combination of rubric-guided feedback and comparative judgment offers a way forward. Rubrics communicate what matters. Comparative judgment validates that we're assessing it fairly and reliably. Together, they create assessment that is both transparent and robust, both intentional and empirically grounded.
In a world where the traditional written essay no longer signals what it once did, this complementary approach matters more than ever. It helps institutions ask the right questions about what they value, to see assessment practice clearly, and to restore trust in their grades. Not by abandoning rubrics, but by playing nicely with them—letting comparative judgment bring them to life.