Research

When the Candidates Pass – But the Exam Doesn’t. How to rescue qualifications in the Age of AI

By Mark House

19th feb 2026

A City of London awarding organisation recently found itself in a strange position. Year after year, candidates were passing a respected Level 6/7 qualification. The statistics looked healthy, the quality assurance paperwork was in order and the exam board could point to detailed rubrics and grade descriptors.

Yet employers were telling a different story.

Newly qualified staff, they said, were struggling with the very situations the qualification was supposed to prepare them for. They could quote rules and repeat frameworks, but when regulation met messy reality, too many of them froze. As one manager put it, “They’ve clearly revised hard, but that’s not the same as being able to think.”

On closer inspection, the problem turned out not to be weak teaching or lenient marking. The problem was that the assessment had stopped measuring what mattered.

A familiar assessment, an unfamiliar gap

The qualification was designed for advanced practitioners in UK financial services. These are people who must interpret complex regulatory requirements, weigh competing obligations and make defensible judgements in ambiguous situations.

The assessment followed a very familiar pattern. Knowledge and basic understanding were tested through multiple‑choice and short‑answer questions. Longer responses, where candidates wrote about case scenarios in more depth, were marked against an analytic rubric and a set of grade descriptors for Pass, Merit and Distinction.

On paper, everything looked respectable. Reliability statistics were fine. The examination board could show that markers were trained and standardised. Scripts were sampled, marks were checked, grade boundaries were reviewed. Nothing in the routine quality processes obviously screamed “crisis”.

But when we sat down with a stack of longer responses, something odd appeared. For questions that were supposed to invite individual judgement and real‑world application, the answers looked remarkably similar. Paragraphs followed the same pattern; examples re‑appeared with only minor tweaks. It was as if the exam had become a memory test for model answers.

This is what happens when detailed rubrics meet high‑stakes contexts. Over time, teaching and preparation bend towards the rubric, and candidates learn to reproduce what the document rewards. Instead of revealing how people think, the assessment was increasingly revealing how well they had learned to imitate a script.

Why AI raises the stakes

In another era, you might have filed this under “slightly too much coaching” and moved on. Generative AI changes the calculation. Large language models are already good at producing plausible, rubric‑friendly prose to predictable prompts. If a question can be answered by covering an advertised list of points in a standard structure, it is the kind of task AI can help with very effectively.

At the same time, employers are becoming wary. They know that candidates can lean on AI tools when drafting applications and preparing for tests. They are looking for assessments that give them confidence about what a person can really do, particularly when technology is not there to whisper suggestions.

In that context, a qualification that mainly rewards template answers is vulnerable from both directions. It under‑measures the nuanced human judgement that is hardest to fake, and it over‑rewards the kind of surface performance that AI can mimic. The question of construct validity – “are we actually assessing the thing we claim to assess?” – suddenly becomes urgent.

A different kind of question

To understand what was really going on, the awarding organisation agreed to pilot Adaptive Comparative Judgement (ACJ) for the long‑form questions. Instead of asking markers to work their way through a rubric, ticking off elements and assigning marks, CJ asks them to compare two scripts side by side and decide which one is better, given a clear holistic question. In this case, the question was

“Which candidate demonstrates a better understanding of regulation and compliance in UK financial services?”

It is a deceptively simple prompt. It forces judges to pay attention to the whole performance: not just whether key points are present, but whether the argument hangs together, whether regulatory considerations are weighed sensibly, and whether theory is genuinely integrated with practice.

To connect this new approach with the old system, we seeded into the CJ session a set of scripts that had already been marked using the original rubric. Their grades were known, but hidden from the judges. The idea was to see where these “known quantities” would land on the CJ quality scale.

What the scale revealed

Once enough pairwise judgements had been made, the RM Compare system produced a scale that placed every script along a single continuum of perceived quality. Reliability was strong. When we highlighted the seeded scripts on this scale, an interesting pattern emerged.

The first observation was reassuring. Scripts previously awarded Distinction mostly appeared towards the top of the CJ scale; those previously judged as Fail tended to sit lower down. The basic ordering of quality was not wildly different. Markers were not randomly inventing grades.

The second observation was more troubling. The spacing between those scripts told a very different story. Some Distinction and Merit pieces were almost touching on the scale, suggesting that the rubric had exaggerated their differences. In other parts of the distribution there were large gaps where the rubric had treated quite distinct performances as essentially the same.

Finally, when the exam board used their existing grade descriptors to place A–E boundaries on the CJ scale, those cuts did not sit neatly between clusters of scripts. Some responses that had earned a comfortable Merit under the rubric fell clearly below the new Merit line. A few that had been agonised over as “borderline” now looked securely above the Pass threshold.

The conclusion was hard to avoid. The problem was not rogue markers or rogue candidates. The problem was that the rubric had pushed everyone to value a narrowed version of the construct. It rewarded visible checklist features more than the deeper understanding and judgement that employers cared about.

Moving to a hybrid model

The response was not to throw out traditional assessments entirely. Knowledge of regulation still matters, and there are plenty of things that multiple‑choice and short‑answer questions can test very efficiently. Instead, the awarding organisation chose to redesign the balance.

The core knowledge component of the qualification will continue to use rubric‑based marking for questions with clear, constrained answers. That part of the exam is about breadth, accuracy and coverage. It is also relatively resistant to the worst abuses of AI, particularly when items are refreshed regularly and delivery is secure.

The longer, scenario‑based questions are changing more radically. These may now be assessed using Comparative Judgement against holistic prompts that are anchored in the capabilities employers actually want to see. The same kinds of professionals who previously worked with rubrics will now compare whole performances, deciding which candidate demonstrates stronger, more convincing regulatory understanding.

Standards will be grounded not just in wording on a page, but in exemplars taken from real scripts at key points on the scale. Those pieces of work will give life to labels such as “Pass” and “Distinction” in a way that is both more concrete for centres and more recognisable for employers.

The original rubric language is not being abandoned; instead, it is being repurposed. Rather than driving scores, it will support feedback, helping explain why a script sits where it does on the CJ scale without constraining judges to count only what can be easily listed.

What we learn from this

This case is about one qualification in one sector, but the pattern is familiar. Wherever open‑ended questions are marked with detailed rubrics in a high‑stakes environment, there is a risk that the assessment gradually shifts from “show us how you think” to “show us that you know what we want”. Add AI into the picture and that drift becomes more dangerous.

Comparative Judgement does not solve every problem. It does, however, offer a powerful way to bring human professional judgement back to the centre of the process and to shine a light on where existing instruments have drifted away from their intended construct.

For the awarding organisation in question, the pilot confirmed what employers had been hinting at for some time: candidates were passing, but the exam was not. Redesigning the assessment so that knowledge is tested efficiently and judgement is assessed holistically is their way of closing that gap.

If you recognise the same tensions in your own assessments – converging answers, expertly coached candidates, uneasy employers – it may be time to ask what your exam is really measuring, and whether a different kind of judgement might tell a truer story.

Group	Name	Domain	Expiration	Security	Purpose
necessary	csrftoken	compare.rm.com	365 days, 0:00:00	HTTP	Helps prevent CSRF attacks
necessary	_cf_bm	vimeo.com	1 day, 0:00:00	HTTP	Used to distinguish between humans and bots
preferences	wtm	compare.rm.com	365 days, 0:00:00	HTTP	Used to store users cookie preference choices
statistics	_ga	rm.com	365 days, 0:00:00	HTTP	Registers a unique ID used to generate statistical data on how visitor used the website
statistics	_ga_#	rm.com	365 days, 0:00:00	HTTP	Used by Google Analytics to collect data on user visits to the website
statistics	_hp2_#	rm.com	1 day, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	_hp2_id.#	rm.com	365 days, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	_hp2_ses_props.#	rm.com	1 day, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	vuid	vimeo.com	365 days, 0:00:00	HTTP	Collects data on the user's visits to the website
marketing	td	googletagmanager.com	0:00:00	HTTP	Used by Google Tag Manager to collect data on the user behaviour and interaction with the website
marketing	h	heapanalytics.com	0:00:00	HTTP	Collects data on the user behaviour and interaction with the website

Name	Domain	Purpose	Expiration	Security
csrftoken	compare.rm.com	Helps prevent CSRF attacks	365 days, 0:00:00	HTTP
_cf_bm	vimeo.com	Used to distinguish between humans and bots	1 day, 0:00:00	HTTP

Name	Domain	Purpose	Expiration	Security
_ga	rm.com	Registers a unique ID used to generate statistical data on how visitor used the website	365 days, 0:00:00	HTTP
_ga_#	rm.com	Used by Google Analytics to collect data on user visits to the website	365 days, 0:00:00	HTTP
_hp2_#	rm.com	Collects data on the user's navigation and behaviour on the website	1 day, 0:00:00	HTTP
_hp2_id.#	rm.com	Collects data on the user's navigation and behaviour on the website	365 days, 0:00:00	HTTP
_hp2_ses_props.#	rm.com	Collects data on the user's navigation and behaviour on the website	1 day, 0:00:00	HTTP
vuid	vimeo.com	Collects data on the user's visits to the website	365 days, 0:00:00	HTTP

Name	Domain	Purpose	Expiration	Security
td	googletagmanager.com	Used by Google Tag Manager to collect data on the user behaviour and interaction with the website	0:00:00	HTTP
h	heapanalytics.com	Collects data on the user behaviour and interaction with the website	0:00:00	HTTP