Opinion

Beyond Formula: The New Frontier for Assessment in an AI World

By Mark House

15th jul 2025

This blog follows up on our earlier reflections on curriculum change and assessment backwash, bringing new insights and urgency to the conversation in the era of AI.

The Washback Risk: What Happens When AI Rewards Formulaic Thinking?

As AI-powered marking becomes more common in education, we must confront a critical risk: the “washback” effect, where the method of assessment shapes how and what students learn. Automated essay scoring powered by AI often rewards essays that adhere closely to expected templates—clear structure, formulaic vocabulary, and standard organization—regardless of whether content is genuinely insightful or original.

This risk isn’t hypothetical. Recent research confirms that AI systems, while remarkably efficient at processing large volumes of essays, fall short when it comes to appreciating creativity, novel reasoning, or unconventional approaches. They’re trained to spot patterns from vast troves of typical responses. As a result, they reward conformity—and can penalise precisely the kind of original thought our future economy and society need most

What the Research Reveals

A wave of recent studies underscores the limitations of AI assessment:

Surface-Level Marking: AI grades consistently reward structure, grammar, and conformity over depth, originality, or creative argumentation. Essays that mimic high-scoring responses—even if content is shallow—are often rated more highly than risk-taking, thoughtful work.
Penalizing Originality: Empirical studies show that AI tends to penalise or undervalue truly original or creative responses, especially when those responses deviate from "normed" patterns found in training data.
Human vs AI Judgement: Direct comparisons in the last two years demonstrate that human markers recognise and reward nuance, complexity, and originality to a much greater degree. Newer "reasoning" AI models narrow the gap slightly, but the risk remains pronounced—particularly for tasks requiring genuine reasoning and analysis.

Lessons from Apple's Landmark Study

This challenge was powerfully highlighted by Apple’s recent study, The Illusion of Thinking. The research exposed that even the most advanced AI models can easily be misled by complexity or unfamiliarity. While standard language models excelled at simple, template-based tasks, they broke down entirely on problems requiring deeper reasoning or adaptation. Even explicit instructions and examples failed to elevate their performance on genuinely complex challenges. In short, AI’s apparent “reasoning” often proved to be little more than pattern-matching, not true understanding

This raises a vital question for educators: If AI can't recognize or reward real thinking, what happens to student motivation, curriculum, and pedagogy?

Washback: Implications for Curriculum and Pedagogy

When assessment tools reward what is easiest for an algorithm to detect—rather than what matters most for learning or life—the effects ripple outward:

Narrowed Curriculum: Teachers and learners, knowingly or not, adapt to the test. Lessons pivot towards producing what the assessment will reward: formulaic writing and safe, conventional answers.
Stifled Creativity: The message to students becomes clear: “Play the game, don’t take risks.” Over time, this erodes both engagement and the broader competencies—creativity, critical thinking, problem-solving—needed in an AI-driven world.
Missed Skills for an AI World: Paradoxically, by leaning into AI’s current strengths, we neglect developing the very skills that make us resilient and relevant. To flourish amid automation and rapid change, learners must cultivate the abilities that AI cannot (yet) replicate: genuine reasoning, creativity, ethical judgement, and adaptability.

If these washback effects are allowed to persist unchallenged, we risk preparing young people for the needs of 2010, not 2035.

The RM Compare Solution: Assessment for Complexity and Creativity

So how can we safeguard against the homogenising effect of AI assessment, while still harnessing its efficiency? The answer lies in approaches that blend human insight and technology—like RM Compare:

Adaptive Comparative Judgement: Rather than marking against a rigid rubric, RM Compare enables judges (human or hybrid) to compare pairs of student work and decide which best meets the intended educational outcomes. This comparative process values the whole performance, capturing qualities like creativity, originality, and nuance that algorithms alone often miss.
Authentic Assessment: By reflecting real comparative judgement, the platform helps ensure that what is assessed is what truly matters—supporting positive washback on curriculum and teaching.
Fairness and Consistency: RM Compare leverages technology to reduce traditional marking biases, while still preserving the human capacity to recognise and reward the exceptional, the original, and the insightful.
Building Skills for the AI Age: Crucially, RM Compare supports the cultivation—and recognition—of the complex skills young people need to thrive, not just survive, in a world reshaped by artificial intelligence

RM Assessment

RM Assessment, as the parent organisation, is actively embracing AI to enhance assessmentby delivering reliable, consistent, and timely marking for large-scale qualifications. At the same time, RM Assessment recognizes the vital importance of human judgment and the need to nurture skills beyond what AI alone can identify. That’s why they are investing in a broader ecosystem of tools, such as RM Compare, which harnesses adaptive comparative judgement to value the depth, originality, and creativity in learner responses. This balanced approach ensures that while AI supports efficiency and fairness at scale, other solutions are being developed to preserve and amplify the rich, nuanced qualities that are essential to both learning and teaching in the AI era

A Call to Action

As educators, school leaders, and policymakers, we face a pivotal choice: allow our assessment systems to narrow the horizon of what’s possible, or reimagine them to unlock the full breadth of human potential—in partnership with, but not dictated by, AI.

The future demands more than formula. Let’s ensure our learners are prepared.

Group	Name	Domain	Expiration	Security	Purpose
necessary	csrftoken	compare.rm.com	365 days, 0:00:00	HTTP	Helps prevent CSRF attacks
necessary	_cf_bm	vimeo.com	1 day, 0:00:00	HTTP	Used to distinguish between humans and bots
preferences	wtm	compare.rm.com	365 days, 0:00:00	HTTP	Used to store users cookie preference choices
statistics	_ga	rm.com	365 days, 0:00:00	HTTP	Registers a unique ID used to generate statistical data on how visitor used the website
statistics	_ga_#	rm.com	365 days, 0:00:00	HTTP	Used by Google Analytics to collect data on user visits to the website
statistics	_hp2_#	rm.com	1 day, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	_hp2_id.#	rm.com	365 days, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	_hp2_ses_props.#	rm.com	1 day, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	vuid	vimeo.com	365 days, 0:00:00	HTTP	Collects data on the user's visits to the website
marketing	td	googletagmanager.com	0:00:00	HTTP	Used by Google Tag Manager to collect data on the user behaviour and interaction with the website
marketing	h	heapanalytics.com	0:00:00	HTTP	Collects data on the user behaviour and interaction with the website

Name	Domain	Purpose	Expiration	Security
csrftoken	compare.rm.com	Helps prevent CSRF attacks	365 days, 0:00:00	HTTP
_cf_bm	vimeo.com	Used to distinguish between humans and bots	1 day, 0:00:00	HTTP

Name	Domain	Purpose	Expiration	Security
_ga	rm.com	Registers a unique ID used to generate statistical data on how visitor used the website	365 days, 0:00:00	HTTP
_ga_#	rm.com	Used by Google Analytics to collect data on user visits to the website	365 days, 0:00:00	HTTP
_hp2_#	rm.com	Collects data on the user's navigation and behaviour on the website	1 day, 0:00:00	HTTP
_hp2_id.#	rm.com	Collects data on the user's navigation and behaviour on the website	365 days, 0:00:00	HTTP
_hp2_ses_props.#	rm.com	Collects data on the user's navigation and behaviour on the website	1 day, 0:00:00	HTTP
vuid	vimeo.com	Collects data on the user's visits to the website	365 days, 0:00:00	HTTP

Name	Domain	Purpose	Expiration	Security
td	googletagmanager.com	Used by Google Tag Manager to collect data on the user behaviour and interaction with the website	0:00:00	HTTP
h	heapanalytics.com	Collects data on the user behaviour and interaction with the website	0:00:00	HTTP