Research

The State of Learning by Evaluating

By Mark House

8th jun 2026

There's a familiar moment in teaching that almost every educator recognises. You hand out the rubric, explain the assignment, and watch thirty faces look back at you with the same expression — a kind of polite blankness that says I understand the words, but I don't yet understand what you mean. The criteria make sense in the abstract. What "good" looks like in practice is another matter entirely.

This gap — between knowing the criteria and genuinely understanding quality — is the problem that Learning by Evaluating (LbE) was designed to close.

The Idea Behind It

LbE is rooted in a deceptively simple insight from educational psychologist Royce Sadler, writing in 1987: verbal descriptions of quality are always, to some degree, fuzzy. A rubric that says "demonstrates sophisticated reasoning" cannot fully convey what sophisticated reasoning looks like in a student's essay. The standard, as Sadler put it, cannot be defined into existence. It has to be experienced.

The way students develop that experience, researchers have found, is through evaluating the work of others. Not reading it passively, but making active judgements — deciding which of two pieces is better, and articulating why. This is the foundation of Learning by Evaluating: students engage in structured comparative assessment of previous work, using a method called Adaptive Comparative Judgement (ACJ), before they attempt the same assignment themselves.

ACJ works by presenting students with pairs of artefacts and asking a simple question: which is better? An adaptive algorithm then refines subsequent pairings, presenting increasingly close comparisons as the session progresses. Students aren't marking against a scale — they're making holistic judgements, then writing a short rationale for each decision. The process was originally developed as a reliable assessment tool, but researchers noticed something interesting when students were placed in the judge's seat: they learned. And they learned faster, and with more depth, than students who had simply read the criteria.

What a Decade of Research Has Found

The evidence base for LbE has grown substantially over the past ten years, and the core finding has remained consistent across different subjects, age groups, and countries. Students who evaluate peer work before starting their own produce higher quality output than those who don't. In one carefully controlled study, seven of the top ten student submissions came from the group that had completed an LbE session beforehand. A longitudinal study found those gains hadn't faded a full year later.

What explains this? A systematic review of 33 studies identified nine distinct things that happen when students engage with peer exemplars. They gain clarity about what the task actually requires. They focus their attention on the things that matter rather than the things that are easy to measure. They feel more confident — less anxious about the blank page — because they've seen that the work is achievable. They reflect, often without prompting, on their own current capabilities relative to what they've just seen. And they raise their own ambitions, using the exemplars as benchmarks to measure and improve against.

Perhaps the most striking research finding for classroom practitioners came from a 2023 study involving 468 students, randomly assigned to evaluate either high-quality, low-quality, or mixed-quality examples through RMCompare before completing the same assignment. The researchers hypothesised that exposure to high-quality work would push students to perform better. It didn't — at least not measurably. There was no statistically significant difference in the quality of what students produced across the three groups. The act of making comparisons, it turns out, is the dominant mechanism. What you compare matters less than the fact that you're comparing at all.

That said, quality does matter in a subtler way. Students shown only high-quality work sometimes labelled genuinely excellent examples as merely average, because they had no frame of reference for the range of quality that exists. The researchers called this quality conditioning — a recalibration of internal standards caused by a too-narrow sample. And students shown only low-quality work were, perhaps unsurprisingly, less able to articulate what improvement looks like. A range of exemplars, avoiding the extremes, seems to offer the richest evaluative environment

The Comment Box Is a Window Into Learning

Two studies took a different approach and looked closely at what students write when they justify their comparison decisions — the short rationale in the comment box that most ACJ tools require. What they found should give any teacher pause.

Nearly half of student comments were incomplete. They contained either an observation about the work (evidence) or a general reason for their preference (reasoning), but rarely both together. Only around 56% of comments demonstrated the full Claim-Evidence-Reasoning structure that characterises genuine analytical thinking. Many students could tell you which piece was better; far fewer could tell you why that mattered for their own upcoming work.

There was also a tendency toward what the researchers called evaluative fixation — students latching onto a single dimension of quality and applying it across every comparison. One student, analysing six pairs of Point of View statements, commented exclusively on whether the stakeholder was clearly identified in every single one, ignoring structure, insight, and clarity of need. Their comparisons weren't wrong, exactly, but they were narrow — and narrow attention during LbE likely produces narrow learning.

This matters because it reframes what teachers should be thinking about during an LbE session. The question isn't just did they complete the comparisons? It's are they thinking richly about what they're seeing?

LbE Isn't Just a Starter Activity

Perhaps the most practically significant insight from recent research is the finding that LbE works not only as a pre-task primer but at every stage of the learning process. Interviews with experienced teachers who had integrated LbE across multiple years of teaching confirmed something that practitioners had been discovering informally: a mid-project comparison session, in which students evaluate their own in-progress work alongside peers', often drives more meaningful iteration than feedback given at the end.

At the start, LbE gives students a sense of the destination. In the middle of a project, it gives them the vocabulary and the reference point to identify exactly where they currently are — and what the next step looks like. At the end, it supports structured reflection that goes beyond "I could have done more research" into genuinely specific analysis of craft and quality.

The teachers interviewed in that study also raised something important about what happens after the comparison session — what the researchers framed as the debrief. Almost without exception, they identified the teacher-led discussion following LbE as the moment when learning consolidated. Without it, students had impressions they couldn't fully articulate. With it, those impressions became transferable principles. One teacher described the moment a student's casual observation about adjustable straps in a design comparison became the entry point for a class discussion on universal design principles. LbE had surfaced the insight; the debrief gave it meaning.

Choosing What Students Compare

One of the most common questions from teachers new to LbE is also the most practical: what should I actually upload? The research offers clearer guidance here than is sometimes appreciated.

Authentic peer work — produced by real students on the same or a similar task — consistently outperforms teacher-created or professionally produced examples in motivational terms. Students respond to seeing that people like them produced work of this quality. The message is implicit but powerful: this is achievable for you too.

The range of quality matters, but probably not in the way you'd expect. Research comparing groups exposed to high-quality-only, low-quality-only, and mixed exemplars found no significant difference in the quality of what students subsequently produced. The comparison process itself drives the learning. But quality range does shape the experience of that learning. Students shown only high-quality examples sometimes struggled to articulate what mediocre work looks like — their internal calibration was skewed by the narrow band they'd seen. Including a genuine range, from passing through to excellent, produces the richest evaluative thinking. Failing work is the exception: multiple studies found that genuinely poor exemplars demotivate students and pull their sense of what's "normal" in the wrong direction.

Variety of approach matters as much as variety of quality. When students see multiple different ways of tackling the same problem — not just better and worse versions of the same approach — they are more likely to explore creatively in their own work rather than converging on an imitation of what they've seen. This directly addresses the concern, common among teachers new to LbE, that showing exemplars will suppress originality. The evidence says the opposite is true, provided the exemplar set is genuinely varied.

On the question of how many comparisons to include, the research is less definitive — no study has systematically optimised session length. What the evidence does suggest is that somewhere between five and eight pairs per student produces meaningful learning within a workable timeframe, and that engagement fatigue sets in when sessions are too long or too repetitive. The practical implication is that a tightly curated set of eight to twelve artefacts — generating five to eight pairwise comparisons per student — is a reasonable starting point for most classroom contexts.

What This Means in the Classroom

The research is clear enough that the question is no longer whether LbE works. The more useful question is what separates a well-implemented session from a superficial one — and the answers are surprisingly practical.

The holistic statement that frames each comparison matters more than it might seem. "Which is the better essay?" and "Which better demonstrates the writer's awareness of their audience?" are both valid prompts, but they produce quite different thinking. Targeted prompts that foreground specific quality dimensions tend to produce richer, more transferable reasoning. Varying the prompt across different sessions — across a unit or a year — also reduces the risk of engagement fatigue, which multiple teachers flagged as a real challenge when LbE was used repeatedly in the same format.

The comment requirement deserves more thought than it typically gets. If students are going to write incomplete reasoning half the time, then scaffolding matters. Sentence starters — "I chose this one because... specifically I noticed... this would help a reader/user because..." — can prompt the kind of full argumentation that produces genuine learning rather than a perfunctory justification.

And the concern that showing exemplars will lead students to copy rather than create turns out to be largely unfounded. Research consistently found that students shown a range of approaches felt more creatively liberated, not less. One student, reflecting on having seen two radically different design solutions side by side, put it memorably: "That doesn't say to me, I'm going to pick one and copy it. It says — there are so many ideas. What's mine?"

The Bigger Picture

LbE represents something more than a useful classroom technique. It reflects a shift in how learning and assessment are understood — away from the model in which students produce work, teachers evaluate it, and the cycle ends, toward one in which evaluation is a skill students develop, practised repeatedly throughout their education rather than experienced only as a verdict at the end.

The act of judging — of holding two things up against each other and asking which is better and why — is precisely what experts do in every field. Editors, engineers, designers, researchers, clinicians: they all possess a finely calibrated sense of quality that they apply constantly. LbE is one of the most direct ways education has found to develop that capacity deliberately, rather than assuming it will emerge on its own.

The evidence for it is robust. The tools to implement it are accessible and field-tested. The only remaining question is whether teachers feel equipped to use it well — and whether they have the support to move beyond the comparison session itself, into the debrief, the mid-project checkpoint, and the sustained integration that turns a technique into a pedagogy.

The research

W. Lee, N. Mentzer, A. Jackson, S. Bartholomew, and A. Clevenger, “Learning by Evaluating in Engineering Design Classrooms: A 5E Instructional Model Perspective from Teachers,” J. Technol. Educ., vol. 37, no. 1, pp. 94–129, 2025
W. Lee, N. Mentzer, A. Jackson, and S. Bartholomew, “A Thematic Analysis of High School Students’ Scientific Argumentation of what Constitutes a ‘Better’ Engineering Design,”Int. J. Technol. Des. Educ., 2025
W. Lee, N. Mentzer, A. Jackson, S. Bartholomew, and A. Clevenger, “Defining and evaluating argumentation quality in the context of design thinking: Using high school students’ design critiques from foundational engineering courses,” Des. Technol. Educ. Int. J., vol. 29, no. 3, 2024
S. Bartholomew, J. Yauney, N. Mentzer, and S. Thorne, “Investigating the Impacts of Differentiated Stimulus Materials in a Learning by Evaluating Activity,” Int. J. Technol. Des. Educ., vol. 34, 2024
S. Thorne, N. Mentzer, G. Strimel, S. Bartholomew, and J. Ware, “A Systematic Literature Review of Student Evaluation of Peer Exemplars and Implications for Design, Technology, and Engineering Learning,” Int. J. Technol. Des. Educ., 2024
S. Thorne, N. Mentzer, G. Strimel, S. Bartholomew, and J. Ware, “Learning by Evaluating: An Exploration of Optimizing Design-Based Instruction,” J. Technol. Educ., vol. 35, no. 2, pp. 53–80, 2024
S. Thorne, N. Mentzer, S. Bartholomew, G. Strimel, and J. Ware, “Learning by Evaluating as an Interview Primer to Inform Design Thinking,” Int. J. Technol. Des. Educ., 2024
N. Mentzer, W. Lee, A. Jackson, and S. Bartholomew, “Learning by Evaluating (LbE): promoting meaningful reasoning in the context of engineering design thinking using Adaptive Comparative Judgment (ACJ),” Int. J. Technol. Des. Educ., Oct. 2023
S. Bartholomew, N. Mentzer, and A. Jackson, “Lessons From Dilbert: Clarifying Design Expectations,” Technol. Eng. Educ., vol. 1, no. 1, pp. 7–13, 2023.
N. Mentzer, W. Lee, and S. Bartholomew, “Examining the validity of adaptive comparative judgment for peer evaluation in a design thinking course,” Front. Educ., vol. 6, 2021

Group	Name	Domain	Expiration	Security	Purpose
necessary	csrftoken	compare.rm.com	365 days, 0:00:00	HTTP	Helps prevent CSRF attacks
necessary	_cf_bm	vimeo.com	1 day, 0:00:00	HTTP	Used to distinguish between humans and bots
preferences	wtm	compare.rm.com	365 days, 0:00:00	HTTP	Used to store users cookie preference choices
statistics	_ga	rm.com	365 days, 0:00:00	HTTP	Registers a unique ID used to generate statistical data on how visitor used the website
statistics	_ga_#	rm.com	365 days, 0:00:00	HTTP	Used by Google Analytics to collect data on user visits to the website
statistics	_hp2_#	rm.com	1 day, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	_hp2_id.#	rm.com	365 days, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	_hp2_ses_props.#	rm.com	1 day, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	vuid	vimeo.com	365 days, 0:00:00	HTTP	Collects data on the user's visits to the website
marketing	td	googletagmanager.com	0:00:00	HTTP	Used by Google Tag Manager to collect data on the user behaviour and interaction with the website
marketing	h	heapanalytics.com	0:00:00	HTTP	Collects data on the user behaviour and interaction with the website

Name	Domain	Purpose	Expiration	Security
csrftoken	compare.rm.com	Helps prevent CSRF attacks	365 days, 0:00:00	HTTP
_cf_bm	vimeo.com	Used to distinguish between humans and bots	1 day, 0:00:00	HTTP

Name	Domain	Purpose	Expiration	Security
_ga	rm.com	Registers a unique ID used to generate statistical data on how visitor used the website	365 days, 0:00:00	HTTP
_ga_#	rm.com	Used by Google Analytics to collect data on user visits to the website	365 days, 0:00:00	HTTP
_hp2_#	rm.com	Collects data on the user's navigation and behaviour on the website	1 day, 0:00:00	HTTP
_hp2_id.#	rm.com	Collects data on the user's navigation and behaviour on the website	365 days, 0:00:00	HTTP
_hp2_ses_props.#	rm.com	Collects data on the user's navigation and behaviour on the website	1 day, 0:00:00	HTTP
vuid	vimeo.com	Collects data on the user's visits to the website	365 days, 0:00:00	HTTP

Name	Domain	Purpose	Expiration	Security
td	googletagmanager.com	Used by Google Tag Manager to collect data on the user behaviour and interaction with the website	0:00:00	HTTP
h	heapanalytics.com	Collects data on the user behaviour and interaction with the website	0:00:00	HTTP