Opinion

How reliable are GCSE grades – and what can we do about it?

By Mark House

17th mar 2026

Every year in England, some things in education feel almost guaranteed. Exam season will arrive on schedule, bringing with it the familiar mix of anxiety, hope and hard work in schools up and down the country. Ofqual will emphasise that our qualifications system is robustly designed, closely regulated and delivering grades that are fair and can be trusted. And Dennis Sherwood will publish fresh analysis arguing – drawing largely on Ofqual’s own technical reports – that neither the level of fairness nor the level of trust we assume is quite what it seems.

From a distance, this rhythm can look almost comforting. Schools do their best for their students, the exam system does its work, and the official message is that outcomes are reliable enough for everyone to move on. Up close, though, Sherwood’s numbers force us to sit with a more awkward question: if grades are as secure as we say they are, why do the data keep suggesting that a significant minority could easily have come out differently?

Yet when you look at the numbers behind that promise, the picture becomes more uncomfortable. Analyses based on Ofqual’s own research suggest that, if every exam script were fairly re-marked by a senior examiner, only around three-quarters of grades would be confirmed, while roughly one in four would be different – about 4.5 million “right” grades and 1.5 million “wrong” grades out of the 6 million awarded each summer. In some subjects the situation is starker still: for GCSE English, for example, estimates of grade reliability cluster around 60%, which implies that close to two in five students may be holding a certificate that does not show the grade a senior examiner would have given their script. This is the evidence that underpins Sherwood’s now-familiar headline claim that “one school exam grade in four is wrong”, and his argument that such levels of uncertainty sit uneasily alongside the way we present grades as precise and definitive judgements.

Part of the tension here is between what schools, students and families understandably need from the system, and what the evidence suggests the system can actually deliver. In a high‑stakes world, everyone reaches for certainty: a grade on a certificate that feels final, objective and beyond dispute. At a system level, the regulator also needs an approach that is scalable at national level, operationally efficient and capable of commanding political support year after year. Official messages often reinforce a sense of firmness, talking about grades as “reliable to one grade either way” or “accurate plus or minus one grade” in ways that sound reassuringly precise. The underlying data, though, point to something more unsettling – a system that works hard to be fair, but is inevitably shot through with uncertainty, especially where complex, extended responses are boiled down to a single letter or number. The result is what Sherwood and others describe as an illusion of certainty: a surface story of accuracy and control wrapped around grades that, statistically, are much more fragile than most of us would like to believe.

Sherwood is not suggesting that examiners are careless or that the system is corrupt; he is asking us to confront something more structural. Extended responses in subjects like English, history and some vocational qualifications are complex pieces of work, and well-trained examiners can legitimately differ over exactly which mark they deserve. When those legitimate differences straddle grade boundaries, the final grade can depend on which acceptable mark is chosen on the day. At scale, small variations in many such decisions aggregate into the reliability figures that now attract headlines.

What the reliability evidence shows

Ofqual’s own technical reports and independent commentaries converge on a similar picture.

Across GCSEs and A levels, around 75% of grades appear to be stable to re-mark by a senior examiner, with around 25% likely to change by at least one grade if re-marked.
Reliability is higher in more objectively marked subjects, such as mathematics and some sciences, and lower in essay-based and performance subjects.
Some subject–tier combinations, such as GCSE English, show reliability estimates around 60%, meaning that roughly two in five grades might differ on re-mark.

For students on or near a boundary, that uncertainty is not theoretical. It can affect entry to sixth form or college, access to apprenticeships, and longer-term confidence in their own abilities.

Where methods like Comparative Judgement might fit in

Given these challenges, it is reasonable to ask whether there are ways of capturing professional judgement that can reduce some of the fuzziness in complex scripts – without pretending to eliminate uncertainty altogether. Comparative Judgement (CJ) is one candidate, but it is not a magic solution and it will never be the right answer for every subject, task or context.

In CJ, examiners compare pairs of student work and decide which is better overall, rather than assigning individual marks against a detailed rubric. A growing body of research – including recent studies using RM Compare on demanding tasks such as long, third‑year law essays – suggests that, when used carefully, Adaptive Comparative Judgement can achieve reliability at least comparable to, and sometimes higher than, traditional marking, while producing grades that broadly align with existing standards.

At the same time, our own work on the “reliability paradox” reminds us that the future of assessment is likely to be more explicitly nondeterministic, not less. For complex, holistic performances, the goal is not to force human judgement into a perfectly deterministic mould, but to use methods like ACJ to harness that variability and turn it into a stable, transparent consensus about quality.

That is why we talk about a multi‑modal “three mirrors” approach: using deterministic, rubric-based assessment where it is strongest; holistic, comparative approaches like RM Compare where professional synthesis matters most; and authenticity checks to ensure the work is genuinely the learner’s own. In this view, CJ is one part of a broader ecosystem, not a replacement for everything that came before.

At RM, and in the RM Compare team specifically, we see this as ongoing learning rather than a finished story. Each new project and each new piece of research, including the latest work on high-stakes university law assessments, helps refine our understanding of where ACJ adds most value, where conventional marking remains the better fit, and how the two can work together to support fair, trustworthy outcomes for students.

A more honest, more robust path forward

If we want to maintain and deepen trust in GCSEs, we may need to adjust the way we talk about grades and the tools we use to produce them.

Make reliability visible: publish clear, accessible, subject-level reliability metrics and explain what they mean for students and schools.
Use the right tools for the right tasks: deploy methods like Comparative Judgement for components where traditional marking is least reliable, either as the primary scoring method or as a validation and moderation layer.
Design policy around probabilities, not certainties: recognise that every grade is an estimate with a margin of error, and ensure progression rules, safety nets and appeals are robust to that fact.

The cyclical nature of the school year will not change, and nor will the need for shared, trusted signals about what students know and can do. What can change is how honest we are about the limits of our current system – and how determined we are to use better evidence and better methods to make grades as fair, reliable and meaningful as possible.

Group	Name	Domain	Expiration	Security	Purpose
necessary	csrftoken	compare.rm.com	365 days, 0:00:00	HTTP	Helps prevent CSRF attacks
necessary	_cf_bm	vimeo.com	1 day, 0:00:00	HTTP	Used to distinguish between humans and bots
preferences	wtm	compare.rm.com	365 days, 0:00:00	HTTP	Used to store users cookie preference choices
statistics	_ga	rm.com	365 days, 0:00:00	HTTP	Registers a unique ID used to generate statistical data on how visitor used the website
statistics	_ga_#	rm.com	365 days, 0:00:00	HTTP	Used by Google Analytics to collect data on user visits to the website
statistics	_hp2_#	rm.com	1 day, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	_hp2_id.#	rm.com	365 days, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	_hp2_ses_props.#	rm.com	1 day, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	vuid	vimeo.com	365 days, 0:00:00	HTTP	Collects data on the user's visits to the website
marketing	td	googletagmanager.com	0:00:00	HTTP	Used by Google Tag Manager to collect data on the user behaviour and interaction with the website
marketing	h	heapanalytics.com	0:00:00	HTTP	Collects data on the user behaviour and interaction with the website

Name	Domain	Purpose	Expiration	Security
csrftoken	compare.rm.com	Helps prevent CSRF attacks	365 days, 0:00:00	HTTP
_cf_bm	vimeo.com	Used to distinguish between humans and bots	1 day, 0:00:00	HTTP

Name	Domain	Purpose	Expiration	Security
_ga	rm.com	Registers a unique ID used to generate statistical data on how visitor used the website	365 days, 0:00:00	HTTP
_ga_#	rm.com	Used by Google Analytics to collect data on user visits to the website	365 days, 0:00:00	HTTP
_hp2_#	rm.com	Collects data on the user's navigation and behaviour on the website	1 day, 0:00:00	HTTP
_hp2_id.#	rm.com	Collects data on the user's navigation and behaviour on the website	365 days, 0:00:00	HTTP
_hp2_ses_props.#	rm.com	Collects data on the user's navigation and behaviour on the website	1 day, 0:00:00	HTTP
vuid	vimeo.com	Collects data on the user's visits to the website	365 days, 0:00:00	HTTP

Name	Domain	Purpose	Expiration	Security
td	googletagmanager.com	Used by Google Tag Manager to collect data on the user behaviour and interaction with the website	0:00:00	HTTP
h	heapanalytics.com	Collects data on the user behaviour and interaction with the website	0:00:00	HTTP

How reliable are GCSE grades – and what can we do about it?

What the reliability evidence shows

Where methods like Comparative Judgement might fit in

A more honest, more robust path forward

Cookies