Product

The research behind Learning by Evaluating: Why RM Compare | ⏱️NOW works

By Mark House

13th may 2026

In the last post, we introduced Assessment as Learning and the idea of Guild Knowledge. This is the tacit, experience‑based ability to recognise quality that experts build over time. RM Compare | ⏱️NOW gives people a short, structured way to test and develop that ability by estimating quality, making comparisons, and seeing how accurate their judgement is.

This post looks at the research behind that idea. Over the last 35 years, studies have built a remarkably consistent evidence base: a lack of calibrated Guild Knowledge is a problem, a short Learning by Evaluating (LbE) intervention can improve it, and all ability levels benefit.

when you know, you know

Sadler: the Guild Knowledge problem

The modern story starts with D. Royce Sadler’s 1989 paper, Formative assessment and the design of instructional systems. Sadler argued that for students to improve, three conditions must be met simultaneously: they must hold a concept of the standard being aimed for, be able to compare their own work against that standard, and know how to close the gap.

He noted that expert teachers carry standards “largely in unarticulated form, inside their heads as tacit knowledge,” and that when they compare and discuss student work together this shared understanding “constitutes a form of guild knowledge.” In other words, experts know good work when they see it, but often cannot fully articulate why in a way that novices can immediately use.

Sadler’s key insight was that learners cannot meaningfully improve if they do not themselves hold a reasonably accurate internal concept of quality. Rubrics, grades and teacher comments help, but they cannot on their own transfer tacit standards from experts into novices’ heads.

What does help, Sadler suggested, is structured exposure to a range of work and practice in making judgements about it. This is precisely the kind of experience that Learning by Evaluating is designed to provide.

Purdue: 20 minutes that change outcomes

Three decades later, Bartholomew and Mentzer tested this idea at scale at Purdue University. In their 2020 study Learning by Evaluating (LbE) through adaptive comparative judgment, they worked with 550 first‑year students on a design‑thinking course and ran a randomised experiment.

Half the students followed the usual instruction; the other half spent around 20 minutes using Adaptive Comparative Judgement to evaluate pairs of anonymised Point‑of‑View statements from a previous cohort before starting their own work. During this LbE activity, students saw real examples across a quality range and had to choose which was stronger each time, mirroring the “estimate then compare” process NOW uses.

The results were striking. Seven of the top ten performers on the subsequent assignment came from the LbE group, despite both groups receiving identical teaching. Across the entire cohort, students who had taken part in the LbE activity performed significantly better than the control group, and this improvement was seen across all ability levels, not just among high achievers.

Importantly, the intervention was small and focused. It was not an extra teaching unit, a detailed rubric workshop or a lengthy training programme. It was simply time spent evaluating and comparing work before producing their own. The authors concluded that Learning by Evaluating “turns the assessment process into a learning experience,” reinforcing Sadler’s claim that structured judgement practice builds internal standards.

Ireland: All boats rise

Parallel research in Ireland explored what happens when students themselves become the judges, using LbE as part of an ipsative approach that tracks improvement relative to each learner’s starting point.

In a study with 128 technology education students across four assignments, Seery, Buckley, Delahunty and Canty (2020) found that the whole cohort improved substantially over time. Mean scores rose from 54.9% on the first assignment to 76.6% on the fourth, and reliability of the ACJ judgements was exceptionally high (α between .965 and .974 across all assignments).

The most striking finding concerned lower‑achieving students. Learners in the lowest quartile (Q1) improved their scores by 40.75% between the first and last assignments. This was the largest absolute gain of any ability group. Higher‑achieving students also improved, though their scores shifted less dramatically in percentage terms, a pattern the authors interpret as partly reflecting rising expectations and more demanding tasks.

Two implications stand out:

All boats rise: Learning by Evaluating produced measurable gains across the whole ability range, not just among students who were already close to the standard.
Judgement develops faster than production skill: even weaker students could reliably identify higher‑quality work when acting as judges, before they themselves were consistently able to produce work at that level.

This supports the idea that Guild Knowledge can be developed before, and independently of, full performance capability, and that structured judgement practice is an efficient way to do so.

Learning by Evaluating across the design process

A more recent paper by Thorne, Mentzer, Strimel, Bartholomew and Ware (2024) extended this work into K‑12 classrooms as part of a National Science Foundation‑funded design‑based research project. Working with five teachers and 414 students in a large US school district, the team explored how LbE could be integrated into a longer design process.

They found that Learning by Evaluating was used at eight of twelve steps in the design cycle, from early ideation through prototyping to communicating solutions. Teachers used LbE in both convergent ways (narrowing in on what makes a solution good) and divergent ways (broadening thinking by exposing students to a range of approaches), a dual use of ACJ not previously described in the literature.

Crucially for the Guild Knowledge story, the authors state explicitly that “the use of exemplars facilitates the transfer and application of tacit knowledge regarding criteria, standards, and the nature and quality of work,” citing Sadler’s original argument. They also note that LbE “potentially reduces time to task completion” and can reduce student anxiety about open‑ended tasks by clarifying expectations through examples rather than abstract descriptions.

Teacher framing emerged as important. How teachers introduced the holistic question (for example, emphasising function vs. aesthetics) influenced what students attended to when judging, reinforcing the need for carefully designed standards and prompts. Nevertheless, across five different classrooms and teaching styles, the approach proved robust and adaptable.

Teacher preferences and differentiated examples

Other studies from the same NSF project explored the role of teacher preferences and the design of example sets in more detail.

Bartholomew and Yauney examined the impact of differentiated stimulus materials. In simple terms, whether the quality range of examples shown to students affects learning. They found that when learners evaluated a broader range of work (including both strong and weak examples), they developed more refined judgement than when they saw only mid‑range or high‑quality work. This supports the idea that standards used in tools n tools like RM Compare | ⏱️NOW should intentionally span the full quality spectrum.

In a separate study on teacher preferences, Bartholomew, Barnum, Jackson, Mentzer and Allen looked at five teachers in the same course, using identical LbE activities but with different wording and emphasis in their holistic questions. They found that teacher framing did influence what students focused on, but that LbE still produced broadly consistent results across classrooms, suggesting the core mechanism is robust even when local pedagogy varies.

Taken together, these findings deepen the practical guidance: good LbE design involves diverse, well‑chosen exemplars and carefully framed questions, but the underlying effect - improvement in judgement - appears stable across settings.

ACJ in practice: fast, intuitive expert judgement

While the studies above focus on learners, research by Buckley and colleagues has examined how professional judges behave when using Adaptive Comparative Judgement in real‑world contexts.

In a 2025 study with 20 industrial designers evaluating over 200 anonymised design portfolios from primary and secondary students, Buckley and Zhu found that difficult judgements did not take significantly longer than easy ones, and that judges maintained consistent speed and reliability across extended sessions. Average decision times were around 50–76 seconds per comparison, and reliability coefficients were high (often above 0.8).

This matters because it shows how Guild Knowledge expresses itself in practice: experienced practitioners make fast, holistic, intuitive judgements that are nevertheless highly reliable, rather than slow, checklist‑driven analyses. LbE is essentially a way of helping more people move towards that expert‑like pattern of judgement, earlier in their development.

Three implications, and how ⏱️NOW fits

Across all of this work, three implications stand out.

1. A lack of Guild Knowledge is a real, measurable problem

Students and practitioners who lack calibrated internal standards produce weaker work, make less reliable judgements, and find open‑ended tasks more anxiety‑inducing. Traditional approaches such as rubrics, lectures on criteria, one‑off exemplars can help, but they do not reliably transfer tacit standards from experts to novices.

2. A short Learning by Evaluating intervention makes a difference

The Purdue study showed that around 20 minutes of LbE before a task led to significant improvements in performance across a cohort of 550 students, with seven of the top ten performers coming from the LbE group. The Seery et al. study showed that with 8–9 judgements per session, students made substantial gains over just four assignments, with the lowest quartile improving by 40.75%.

Thorne et al. add that LbE can reduce time to task completion and help learners understand expectations more quickly across multiple stages of a design process. In other words, the gap between “uncalibrated” and “meaningfully calibrated” judgement can be narrowed in surprisingly short, focused sessions.

3. All boats rise

Both the Purdue and Irish studies found that LbE benefits learners across the ability range. In Purdue, the achievement uplift was seen at all performance levels; in the ipsative study, the greatest absolute gains were in the lowest quartile, while higher‑achieving students also improved and refined their judgement.

This is unusual. Many interventions disproportionately help those already near the standard. LbE, by contrast, appears to lift the entire cohort, which is exactly what organisations need when trying to raise the quality of judgement across whole teams, not just a few individuals.

Why this matters for RM Compare | ⏱️NOW

⏱️RM Compare | NOW is designed to embody these research findings in the simplest possible experience.

It gives practitioners a short LbE session that can be completed in minutes on any device, with no login or setup.
It uses trusted standards built in 💻RM Compare | Studio, ensuring that learners are exposed to a genuine quality range rather than a narrow set of examples.
It provides immediate feedback on judgement accuracy, making tacit gaps visible and giving people a clear sense of where their Guild Knowledge is and how it is changing.

The research suggests that when experiences like this are used thoughtfully and repeatedly, people become better at recognising quality, regardless of their starting point. RM Compare | ⏱️NOW is the fastest way into that process.

In the next post in this series, we will look at what this means for organisations: the costs of operating without Guild Knowledge, the risks of over‑relying on AI and rigid process, and the business case for investing in judgement development alongside traditional assessment.

If you want to see what Learning by Evaluating feels like in practice, you can try ⏱️RM Compare | NOW today.

Group	Name	Domain	Expiration	Security	Purpose
necessary	csrftoken	compare.rm.com	365 days, 0:00:00	HTTP	Helps prevent CSRF attacks
necessary	_cf_bm	vimeo.com	1 day, 0:00:00	HTTP	Used to distinguish between humans and bots
preferences	wtm	compare.rm.com	365 days, 0:00:00	HTTP	Used to store users cookie preference choices
statistics	_ga	rm.com	365 days, 0:00:00	HTTP	Registers a unique ID used to generate statistical data on how visitor used the website
statistics	_ga_#	rm.com	365 days, 0:00:00	HTTP	Used by Google Analytics to collect data on user visits to the website
statistics	_hp2_#	rm.com	1 day, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	_hp2_id.#	rm.com	365 days, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	_hp2_ses_props.#	rm.com	1 day, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	vuid	vimeo.com	365 days, 0:00:00	HTTP	Collects data on the user's visits to the website
marketing	td	googletagmanager.com	0:00:00	HTTP	Used by Google Tag Manager to collect data on the user behaviour and interaction with the website
marketing	h	heapanalytics.com	0:00:00	HTTP	Collects data on the user behaviour and interaction with the website

Name	Domain	Purpose	Expiration	Security
csrftoken	compare.rm.com	Helps prevent CSRF attacks	365 days, 0:00:00	HTTP
_cf_bm	vimeo.com	Used to distinguish between humans and bots	1 day, 0:00:00	HTTP

Name	Domain	Purpose	Expiration	Security
_ga	rm.com	Registers a unique ID used to generate statistical data on how visitor used the website	365 days, 0:00:00	HTTP
_ga_#	rm.com	Used by Google Analytics to collect data on user visits to the website	365 days, 0:00:00	HTTP
_hp2_#	rm.com	Collects data on the user's navigation and behaviour on the website	1 day, 0:00:00	HTTP
_hp2_id.#	rm.com	Collects data on the user's navigation and behaviour on the website	365 days, 0:00:00	HTTP
_hp2_ses_props.#	rm.com	Collects data on the user's navigation and behaviour on the website	1 day, 0:00:00	HTTP
vuid	vimeo.com	Collects data on the user's visits to the website	365 days, 0:00:00	HTTP

Name	Domain	Purpose	Expiration	Security
td	googletagmanager.com	Used by Google Tag Manager to collect data on the user behaviour and interaction with the website	0:00:00	HTTP
h	heapanalytics.com	Collects data on the user behaviour and interaction with the website	0:00:00	HTTP