Research

The OECD Just Mapped the Certification Problem. Here's the Solution.

By Mark House

8th apr 2026

The OECD's new report on upper secondary certification is 195 pages of exceptional comparative analysis. It maps 71 certificates across 38 education systems. It identifies why so many reform attempts stall. It names four principles that any credible certificate must balance (relevance, credibility, fairness, and manageability) and then shows, country by country, how every system struggles to achieve more than two or three of them at once.

It is perhaps one of the most thorough surveys of the assessment standardisation problem ever published.

And it doesn't solve it.

That's not a criticism. The report is honest about this. Its "Future Work" section lists marking, moderation and grading systems as the primary unresolved challenge, and calls for further research into how systems can reliably assess the complex, authentic work that modern curricula demand. The authors know exactly where the gap is. I would argue that what we don't yet have is a different cognitive model for filling it.

OECD The Theory and Practice of Upper Secondary Certification Report 2026 Full Response

View PDF

The Trap Every System Is In

The OECD's analysis surfaces the same structural tension in country after country. The assessment methods that produce the most valid picture of student capability (teacher assessment of portfolios, projects, extended writing, practical performances) are also the hardest to standardise. The methods that are easiest to standardise (timed written examinations under controlled conditions) systematically underassess the skills modern economies need most.

Every country is running some variation of the same experiment: can you add enough external scaffolding around teacher judgement to make it reliable without destroying its validity? The answer, repeatedly, is no, or at least, not without enormous cost.

Sweden tried. After years of growing divergence between teacher grades and national standards, a 2025 government investigation proposed anchoring all student grades to external exam results by 2030. Credibility restored, perhaps. But what happens to the teacher assessment of complex work that exams cannot reach?

Lithuania tried. It removed school-level examinations entirely and required all certification to be set and marked at state level. Clean. Standardised. And now a system in which, as the report notes, only 1.7% of vocational graduates progressed to higher education compared to 57.8% of their general education peers because the assessment was calibrated for one kind of learning.

England tried, in 2020, to standardise calculated grades statistically when exams were cancelled. The algorithm was technically defensible. 39% of school-submitted grades were revised downward. The public outcry forced the government to abandon it within days. The problem wasn't the statistics. It was the opacity where parents and students could not understand how an absolute grade, assigned by a teacher who knew a student, had been transformed by a formula into something different.

The OECD diagnoses all of these cases accurately. What it doesn't do is ask whether the problem lies not in the execution of absolute judgement but in absolute judgement itself.

The Cognitive Model Nobody Is Questioning

Every assessment system described in the report, whether teacher-based, externally marked, or statistically moderated, rests on the same assumption: that the core act of assessment is a human being assigning an absolute score or grade to a piece of student work.

This is a harder cognitive task than it appears. Assigning a reliable absolute grade to a complex essay, design project, or performance requires the assessor to hold a calibrated mental standard, apply it consistently across many pieces of work, and produce scores that are comparable to those of colleagues they may never have met, working in schools they may never have visited.

Research in psychometrics has known for nearly a century (since Thurstone's Law of Comparative Judgement in 1927) that human beings are far more reliable at a simpler task: deciding which of two things is better. Not by how much. Not on which specific criteria. Just: which one?

This is the cognitive model that Comparative Judgement brings to assessment. And the evidence for its superiority over absolute marking, in the specific domain of complex, holistic work, is now substantial.

What Changes When You Reverse the Question

When you ask a teacher not "what grade does this essay deserve?" but "which of these two essays is stronger?", several things happen simultaneously.

The reliability of the resulting judgements increases dramatically. Because comparative judgement maps onto a natural human perceptual capacity (the same one that allows us to say confidently that this hill is steeper than that one, without needing to measure either, individual judges are far more consistent, and the aggregate of many pairwise comparisons produces a rank order that is statistically anchored in the collective discrimination of the whole judging community rather than in any single marker's calibration.

The validity of what can be assessed expands. Comparative judgement is format-agnostic. Judges can compare essays, portfolios, design artefacts, recorded performances, research projects, practical demonstrations or pretty much anything anything that can be placed in front of a human expert. This means the full range of complex, authentic, higher-order work that the OECD report calls for can now be assessed with high reliability, rather than being retreated from in favour of more markable but less valid exam questions.

The standardisation problem changes its nature. In an absolute marking system, standardisation is the attempt to make many different people apply the same scale to the same work. It requires mark schemes, standardisation meetings, re-marking, statistical moderation. This is all the apparatus the OECD report catalogues and finds wanting. In a comparative judgement system, standardisation is an emergent property of aggregated pairwise decisions. There is no mark scheme to drift from. There is no grade boundary to set by committee. The rank order is anchored in the actual work, judged comparatively, by many people.

And the credibility argument changes too. The England 2020 crisis was fundamentally about opacity where a grade produced by a teacher who knew a student was transformed by an algorithm nobody could explain. A Comparative Judgement outcome is the opposite of opaque. Every pairwise decision is recorded. The statistical quality of the rank order is measurable and visible. The process by which any student's position was determined is fully auditable. This is not the black box of statistical standardisation; it is a transparent record of collective professional judgement.

The Question the Report Asks Without Realising It

In discussing teacher assessment calibration, the implicit challenge throughout is this:

How much simpler it would all be if teachers regularly judged student work from a national sample of schools — not merely looked at it.

The distinction matters. Every system the report describes has tried to solve calibration through passive exposure: publishing exemplars, running standardisation meetings, circulating mark schemes. The evidence across 38 systems is that this does not work reliably enough. Teachers' absolute standards drift. Grades diverge. Credibility erodes.

What active comparative judging does when a teacher sits down to compare essays from their own school against essays from schools they've never visited, making a series of rapid "which is better?" decisions is something categorically different from reading exemplars. It is calibration through decision-making. The research shows that the act of making comparative judgements, repeatedly and across a wide range of work, builds and sustains assessors' professional understanding of quality in a way that passive exposure does not.

This is also, incidentally, one of the most powerful professional development experiences available to a teacher. When you have compared 30 pairs of essays from six different schools and found yourself surprised by the quality in schools you expected less from, your teaching changes. Your feedback changes. Your mental model of what students can produce and what your own students are capable of changes. This is not an add-on to assessment. This is assessment as learning.

Three Levels, One System

The OECD report focuses almost entirely on Assessment of Learning which is the summative, certificating function of upper secondary assessment. But the comparative judgement approach, properly implemented, operates productively at three levels simultaneously:

Assessment of Learning - producing reliable, valid, transparent rank orders of complex student work that can underpin certification, without the standardisation failures the report documents.

Assessment for Learning - using comparative judgement formatively during a course to give students and teachers meaningful signals about relative quality without reducing complex work to a grade.

Assessment as Learning - deploying the comparative judgement act itself as the learning experience. When students compare and evaluate each other's work, they develop a far more sophisticated understanding of quality criteria than they acquire through direct instruction. When teachers judge work from other schools, they develop the calibration that every system in the OECD report has struggled to build through other means.

These three levels are not separate products or initiatives. They are the natural consequence of treating comparative judgement not as a niche assessment technique but as the cognitive foundation of a system.

What This Means for Policy

The OECD report's Future Work agenda calls for research on marking, moderation and grading systems and specifically on social moderation structures and local-national feedback loops for supporting confidence in teacher-given grades.

This is the right question. We believe that the answer is comparative judgement at scale. A national infrastructure in which teachers regularly judge student work from outside their own institution, not to produce a grade but to produce a calibrated rank order that anchors both summative certification and ongoing professional development.

This is not a speculative proposition. The technology exists. The psychometric evidence is established. The only thing missing is the willingness, at system level, to question whether absolute judgement is the right cognitive foundation for assessment or whether it is simply the one we inherited.

The OECD report has mapped the territory of the problem with exceptional care. The standardisation challenge is real, it is persistent, and it has defeated every country that has tried to solve it from within the existing paradigm.

The solution is not a better mark scheme. It is a better question.

This post responds to: OECD (2026), The Theory and Practice of Upper Secondary Certification, OECD Publishing, Paris. https://doi.org/10.1787/b3fea5ba-en

Group	Name	Domain	Expiration	Security	Purpose
necessary	csrftoken	compare.rm.com	365 days, 0:00:00	HTTP	Helps prevent CSRF attacks
necessary	_cf_bm	vimeo.com	1 day, 0:00:00	HTTP	Used to distinguish between humans and bots
preferences	wtm	compare.rm.com	365 days, 0:00:00	HTTP	Used to store users cookie preference choices
statistics	_ga	rm.com	365 days, 0:00:00	HTTP	Registers a unique ID used to generate statistical data on how visitor used the website
statistics	_ga_#	rm.com	365 days, 0:00:00	HTTP	Used by Google Analytics to collect data on user visits to the website
statistics	_hp2_#	rm.com	1 day, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	_hp2_id.#	rm.com	365 days, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	_hp2_ses_props.#	rm.com	1 day, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	vuid	vimeo.com	365 days, 0:00:00	HTTP	Collects data on the user's visits to the website
marketing	td	googletagmanager.com	0:00:00	HTTP	Used by Google Tag Manager to collect data on the user behaviour and interaction with the website
marketing	h	heapanalytics.com	0:00:00	HTTP	Collects data on the user behaviour and interaction with the website

Name	Domain	Purpose	Expiration	Security
csrftoken	compare.rm.com	Helps prevent CSRF attacks	365 days, 0:00:00	HTTP
_cf_bm	vimeo.com	Used to distinguish between humans and bots	1 day, 0:00:00	HTTP

Name	Domain	Purpose	Expiration	Security
_ga	rm.com	Registers a unique ID used to generate statistical data on how visitor used the website	365 days, 0:00:00	HTTP
_ga_#	rm.com	Used by Google Analytics to collect data on user visits to the website	365 days, 0:00:00	HTTP
_hp2_#	rm.com	Collects data on the user's navigation and behaviour on the website	1 day, 0:00:00	HTTP
_hp2_id.#	rm.com	Collects data on the user's navigation and behaviour on the website	365 days, 0:00:00	HTTP
_hp2_ses_props.#	rm.com	Collects data on the user's navigation and behaviour on the website	1 day, 0:00:00	HTTP
vuid	vimeo.com	Collects data on the user's visits to the website	365 days, 0:00:00	HTTP

Name	Domain	Purpose	Expiration	Security
td	googletagmanager.com	Used by Google Tag Manager to collect data on the user behaviour and interaction with the website	0:00:00	HTTP
h	heapanalytics.com	Collects data on the user behaviour and interaction with the website	0:00:00	HTTP