Research

When AI Beats Economists – And Why That’s Good News For Assessment

By Mark House

21st apr 2026

Not so long ago, the idea that an AI system could out‑analyse a room full of economists would have sounded like science fiction. Yet that’s exactly what a recentFederal Reserve working paper set out to test.

Economist Seran Grundl gave three agentic AI systems and 146 human research teams the same job: answer a deceptively simple question about the impact of the DACA immigration programme on full‑time employment. Everyone – human and machine – received the same data, the same instructions, and a sequence of three tasks that gradually narrowed their freedom to choose methods and definitions. In the first task, teams had almost complete freedom to design their own study. In the second, they were given a prescribed research design but retained plenty of analytic choices. In the third, they were handed a pre‑cleaned dataset and a tightly specified framework. At each stage, the goal was the same: estimate the causal effect of DACA eligibility on working full time, and justify the choices that led to that estimate.

The bold twist came at the end. Instead of sending all the submissions to a human panel, Grundl asked other AI models to act as reviewers. These reviewers read the code and reports, followed a detailed template, and then ranked each group of four submissions – three from AI agents, one from a human team – from best to worst. The outcome is arresting. On average, across every task and every reviewer model, the AI submissions came out ahead of the humans. Codex/GPT‑5.4 tended to top the table, followed by another Codex variant and Claude‑based agents, with human researchers consistently ranked last.

At first glance, that sounds like the familiar “AI beats humans” story – a simple tale of replacement. Look a little closer, though, and a different picture emerges, one that speaks directly to the future of marking and to the role of an AI Validation Layer.

The Real Story: Variance Everywhere

What matters is not just who “wins”, but how spread out the answers are. Grundl’s analysis shows that while the average AI estimate sits close to the average human estimate, the distribution of human results has much fatter tails. Human teams, working independently, produce a surprisingly wide range of answers – some much higher, some much lower – all supposedly answering the same question with the same data. AI systems are not immune to this either. Run the same model a hundred times and you do not get a single “true” number; you get a distribution, sometimes even switching sign.

In other words, neither one expert nor one model is enough. The deeper story in this paper is about variance: how much our answers can drift, even in a tightly controlled setting, simply because analysis is a long chain of judgement calls. That will sound familiar to anyone who has worked with assessment. Give two markers the same script and you can get two defensible but meaningfully different marks. Give one large language model the same answer twice and you may get two slightly different scores; change the prompt, the sampling settings or the model version and the differences can grow. The study shows that the “garden of forking paths” is not just a human problem. AI introduces new kinds of variability – sampling randomness, prompt sensitivity, and model evolution over time. It does not eliminate judgement; it multiplies it.

For any system that wants to use AI for marking, that is the central challenge. The question is no longer whether AI can mark like an expert – it clearly can, under the right conditions. The question is how we manage the variability of both human and AI judgements so that the results we rely on are stable, fair and trustworthy.

From One Marker to Many: Why AI Marking Needs a Validation Layer

Traditional marking systems often rely on a single marker or a small number of markers per script, with some sampling for moderation. Early AI marking attempts sometimes followed the same pattern: plug in a model, run it once per response, and treat the output as ground truth. Grundl’s study suggests a different mindset. Instead of treating any single judgement – human or machine – as final, we should treat it as one draw from a distribution. On that view, a single AI mark is a powerful but noisy signal, a single human mark is also a powerful but noisy signal, and the real reliability comes from how those signals are combined and checked over time.

This is exactly where the idea of an AI Validation Layer comes in. An AI Validation Layer does not try to hide AI behind a curtain or pretend that discrepancies never happen. Instead, it anchors AI against a robust human consensus on quality, rather than against any one marker. It continuously re‑checks how AI judgements line up with that consensus as models, prompts and contexts change. And it surfaces where AI and human judgements part company, so that those differences can be understood, governed and, where necessary, constrained.

The Grundl paper gives this idea a powerful external example. It shows that AI can reach and sometimes surpass expert‑level performance on complex, open‑ended tasks. It shows that both humans and AIs exhibit substantial dispersion across instances; there is no single, stable oracle. And it shows that asking AIs to act as reviewers – not just scorers – mirrors the kind of comparative processes we already know are more reliable than isolated scores. The right response to “AI is now good enough to mark” is not to let it run unsupervised, but to wrap it in a validation process that is worthy of the stakes.

Judgement as Comparison, Not Just a Number

There is another aspect of the paper that should catch the eye of anyone interested in assessment and AI. The reviewing AIs are not simply emitting numbers. They are comparing, side by side, multiple complete pieces of work. They are being asked to reason about design choices, assumptions, robustness checks and weaknesses, and then to decide which analysis is best overall in that small group. The rankings emerge from relative comparisons, not from absolute scores conjured in isolation.

That is very close to the logic of Adaptive Comparative Judgement. When RM Compare presents human judges with pairs of student essays, portfolios or performances, it does not ask for a finely calibrated mark. It asks a simpler, more natural question: which of these two is better overall? From many such comparisons, a highly stable scale emerges. The Grundl study suggests that AI may be more reliable in precisely this mode: as a comparative judge operating within a structured process, rather than as a standalone scoring engine. Instead of asking an AI to assign 17 out of 25 to a single script, we can ask it which of two or four scripts is stronger, and why. Those comparative judgements from AI and from humans can then be aggregated statistically into robust parameter values and rank orders.

Seen this way, AI becomes one more powerful judge in a larger system of judgement – not a replacement for that system.

What This Means for AI Marking in Practice

All of this has concrete implications for how we design AI‑enabled marking. The first is that we can afford to be ambitious. If AI can match PhD economists on open‑ended causal inference tasks, then it can certainly help evaluate student work, professional portfolios or complex written responses, provided the workflows and prompts are carefully designed. There is no need to confine AI to multiple‑choice questions or ultra‑constrained formats.

The second is that we should be sceptical of any approach that treats a single AI pass as definitive. Variability is a feature, not a bug. Well‑designed marking systems will combine multiple AI signals – whether from repeated runs, different models, or different prompt profiles – and will combine those signals with human judgement, especially in high‑stakes or ambiguous cases. They will monitor patterns over time so that drift and bias are detected early, rather than after students have been affected.

The third is that comparative approaches deserve to sit at the centre of this new landscape. Whether the judges are human, AI, or a mixture of both, asking for comparative judgements and then aggregating them makes better use of noisy individual opinions than asking for isolated scores and hoping they line up.

RM Compare and the AI Validation Layer come together at exactly this point. RM Compare provides the infrastructure for large‑scale comparative judgement. The AI Validation Layer uses that infrastructure to establish human‑grounded standards against which AI marking can be calibrated, to test and retest AI models as they evolve, and to provide transparent evidence when institutions need to answer the question of who is assessing the AI that is assessing their students.

When Everything Starts To Look “Claudey”

There is another way to read Grundl’s results that will feel uncomfortably familiar to anyone watching generative AI seep into every corner of knowledge work. In this experiment, the AI systems and the human economists end up in roughly the same place on average – their means and medians are close – but the humans display much fatter tails. Their estimates spread out further in both directions, while the AI estimates cluster more tightly around the middle.

You could think of that as a picture of a future in which AI is, in effect, tuned to the median. Models are trained, aligned and fine‑tuned to avoid the worst mistakes, to stay within policy, to sound reasonable and to follow instructions. That makes them extremely good at producing competent, defensible, “on‑brief” work. It also nudges them away from the edges – away from the strange, risky, or genuinely heterodox moves that often live out in the tails of human judgement.

Extrapolated out into assessment, research and professional writing, the risk is a kind of creeping sameness. Analyses start to share a family resemblance. Essays converge on a common voice. Mark schemes drift towards what the model thinks typical good work looks like. Without anyone quite deciding it, we slide into a world where everything feels just a bit… Claudey.

Seen through that lens, the wider dispersion of the human economists is not just noise; it is where the human “X factor” shows up. The tails are where you find the unusually insightful approaches as well as the dead ends. If assessment systems lean too heavily on median‑seeking AI, they may quietly compress that space, making it harder to recognise and reward the truly distinctive.

For RM Compare and the AI Validation Layer, this is a reason to keep human judgement at the centre of the story. Comparative judgement does not hard‑code a single notion of “median quality”. It lets a community of judges shape and reshape the standard by continually comparing real work. And the validation layer gives you a way to see, in the data, whether AI‑influenced marking is collapsing everything towards the middle, or whether your system is still sensitive to the exceptional – the pieces of work that do not look Claudey at all.

There is a deeper danger here too. The same tight clustering that makes AI outputs look reassuringly consistent is also what can make them brittle. In Grundl’s study, once the analytical framework is prescribed, the agents build very similar datasets and their estimates pull in close around the centre. That’s ideal when the framework is sound. But if a wrong assumption, a missing mechanism or a subtle bias is baked into that framework, the result is high‑confidence, tightly‑grouped, incorrect answers at scale. Human economists look messier in the graphs because their estimates spill out into the tails, yet that variance is also where the system keeps a capacity for dissent: a correct outlier has more room to appear and challenge the emerging consensus.

The Next Chapter: Human and AI Judgement, Together.....perhaps

The Grundl paper is striking because it shows that AI can now match, and sometimes surpass, human experts on specific analytical tasks. But it is just as important for what it says about the landscape those experts and systems now inhabit. There will be more models, more versions, more prompts and more pathways through the same problem than any single person could ever track. The challenge is no longer to pick a single hero – human or machine – and hope for the best. It is to build systems that can harness all that noisy capability and turn it into something coherent, fair and trustworthy.

Adaptive Comparative Judgement was designed to do exactly that for human work. The AI Validation Layer extends the same principle to a mixed world of human and AI outputs. Far from making comparative judgement obsolete, research like Grundl’s suggests it may be one of the approaches that becomes more valuable as AI advances, because it tackles the question that will matter most in AI marking: given all these possible answers, from all these different minds and models, which ones really stand up when we compare them?

Group	Name	Domain	Expiration	Security	Purpose
necessary	csrftoken	compare.rm.com	365 days, 0:00:00	HTTP	Helps prevent CSRF attacks
necessary	_cf_bm	vimeo.com	1 day, 0:00:00	HTTP	Used to distinguish between humans and bots
preferences	wtm	compare.rm.com	365 days, 0:00:00	HTTP	Used to store users cookie preference choices
statistics	_ga	rm.com	365 days, 0:00:00	HTTP	Registers a unique ID used to generate statistical data on how visitor used the website
statistics	_ga_#	rm.com	365 days, 0:00:00	HTTP	Used by Google Analytics to collect data on user visits to the website
statistics	_hp2_#	rm.com	1 day, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	_hp2_id.#	rm.com	365 days, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	_hp2_ses_props.#	rm.com	1 day, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	vuid	vimeo.com	365 days, 0:00:00	HTTP	Collects data on the user's visits to the website
marketing	td	googletagmanager.com	0:00:00	HTTP	Used by Google Tag Manager to collect data on the user behaviour and interaction with the website
marketing	h	heapanalytics.com	0:00:00	HTTP	Collects data on the user behaviour and interaction with the website

Name	Domain	Purpose	Expiration	Security
csrftoken	compare.rm.com	Helps prevent CSRF attacks	365 days, 0:00:00	HTTP
_cf_bm	vimeo.com	Used to distinguish between humans and bots	1 day, 0:00:00	HTTP

Name	Domain	Purpose	Expiration	Security
_ga	rm.com	Registers a unique ID used to generate statistical data on how visitor used the website	365 days, 0:00:00	HTTP
_ga_#	rm.com	Used by Google Analytics to collect data on user visits to the website	365 days, 0:00:00	HTTP
_hp2_#	rm.com	Collects data on the user's navigation and behaviour on the website	1 day, 0:00:00	HTTP
_hp2_id.#	rm.com	Collects data on the user's navigation and behaviour on the website	365 days, 0:00:00	HTTP
_hp2_ses_props.#	rm.com	Collects data on the user's navigation and behaviour on the website	1 day, 0:00:00	HTTP
vuid	vimeo.com	Collects data on the user's visits to the website	365 days, 0:00:00	HTTP

Name	Domain	Purpose	Expiration	Security
td	googletagmanager.com	Used by Google Tag Manager to collect data on the user behaviour and interaction with the website	0:00:00	HTTP
h	heapanalytics.com	Collects data on the user behaviour and interaction with the website	0:00:00	HTTP