- Opinion
Can Foundation Models Really Replace Human Judgement? What GPT-5.2's Launch Today Tells Us, and Why RM Compare Matters More Than Ever
Today, December 11, 2025, OpenAI launched GPT-5.2, its "most advanced model yet for professional work," with claimed improvements in coding, reasoning, long-context analysis, and fewer hallucinations than any previous version. The benchmarks are impressive: GPT-5.2 Thinking scored 100% on a challenging mathematics test, outperformed human professionals on 70% of well-defined knowledge work tasks across 44 occupations, and reduced factual errors by 30% compared to its predecessor.
If you've been watching the space, you might be wondering: is this it? Have we reached the point where foundation models can truly replace human judgment?
The answer is still no. And understanding why—especially on the day GPT-5.2 launches—is critical for anyone building assessment, hiring, or decision systems that matter.
The seductive promise
The pitch is getting more compelling by the week. GPT-5.2 can draft spreadsheets, interpret complex images, manage multi-step projects, and handle 400,000-token contexts - essentially hundreds of documents at once. Microsoft is already rolling it out across its Copilot suite. OpenAI's own benchmarks show the model matching or exceeding expert human performance on structured professional tasks.
And it's not just OpenAI. Google's Gemini 3, Anthropic's Opus 4.5, and a growing ecosystem of "LLM-as-a-judge" tools are demonstrating that foundation models can approximate human preferences with surprising accuracy in controlled settings. Research shows that leading models often match crowd-labeled judgments, predict modal human responses in vignettes, and can be fine-tuned using preference data to align with human values.
So if models can simulate human preferences, why bother with human panels at all? Why spend time training assessors, running comparative judgment sessions, or convening expert judges when an AI could generate decisions faster, cheaper, and at scale?
What the evidence actually shows, and what it doesn't
Here's the reality behind today's headlines.
GPT-5.2 is impressive. The improvements in reasoning, coding, and factual accuracy are real. But even OpenAI's own framing gives the game away: the model excels at "well-specified knowledge work tasks" and "structured professional applications."
What does "well-specified" mean? It means tasks with clear inputs, defined outputs, and established patterns in training data. It means domains where there's a right answer, or at least a statistically dominant answer.
But human judgement - real human judgement in assessment, hiring, admissions, grants, and other high-stakes decisions - is rarely that neat.
Research on foundation models and human preference modelling shows:
They can reproduce aggregate patterns. Models trained on preference data learn to predict what most people would say in a dataset. A moral judgement model can match crowd-labelled norms over 90% of the time. An affective cognition model can predict modal emotional responses better than individual humans.
But they struggle with variation. Minority views, cultural differences, individual expertise, and context-sensitive judgement- the things that make human assessment judgement rather than just scoring - are exactly where models fall short. A model might reproduce the "safe" choice, while systematically missing the candidate that 20% of expert judges would champion because they see something non-obvious.
They're brittle across contexts. When the task changes - a new type of application, a new population, a new framing - models drift. They pick up surface patterns but miss deeper reasoning. A model trained on "good essays" learns statistical features of good essays, not the why behind expert judgement.
They don't calibrate like humans. Expert human judges improve through experience. They align with each other, they develop connoisseurship, they learn what quality looks like in their domain. Models stay static between training runs. You don't know if they're staying aligned with your actual values or drifting until something goes visibly wrong.
Even GPT-5.2's impressive hallucination reduction - down to 10.8% on factual questions, 5.8% with web access - means roughly 1 in 10 statements may still be wrong. In high-stakes assessment or hiring, that's not acceptable without human oversight
The gap that remains, even with GPT-5.2
OpenAI launched GPT-5.2 under competitive pressure after Google's Gemini 3 topped several benchmarks and CEO Sam Altman declared an internal "code red." The company has been explicit: they want to "unlock even more economic value" and prove ChatGPT can handle "professional work" at scale.
But here's what they haven't claimed, and can't claim:
- That GPT-5.2 can replace expert human judgement in domains where preferences are heterogeneous, contextual, or evolving.
- That it can substitute for the calibration, dissent, and collective deliberation that make human panels trustworthy.
- That organisations can use it as a sole decision-maker in high-stakes contexts without ongoing human validation.
Recent expert commentary in venues like PNAS explicitly warns against treating LLMs as drop-in replacements for humans in behavioural research or policy analysis. Studies on LLMs in hiring and risk assessment show systematic biases that surface only under scrutiny.
The core problem: models learn to approximate labelled training data, not human values themselves. They can predict what most people said in a dataset, but that's not the same as simulating the richness, adaptability, and contestability of real human judgement.
Why this matters for assessment and meritocracy
If organisations believe GPT-5.2 or its successors can fully simulate human preferences, they face a false choice:
Option A: Hire humans to judge, run comparative sessions, build expertise - expensive and slow.
Option B: Use an AI judge - cheaper, faster, scalable.
If you believe the benchmarks, Option B looks obviously right.
But here's what actually happens when you choose Option B alone:
- You lose visibility into bias. A human panel surfaces disagreement and systematic patterns. An AI judge gives you a score and a rationale. You don't know if it's excluding good candidates for the wrong reasons until something goes visibly wrong.
- You lose the calibration signal. Expert judges improve through experience and align with each other over time. A model stays static. You have no way to know if it's aligned with your actual values or drifting.
- You inherit the model's blindnesses. If the foundation model's training data underrepresented certain groups or undervalued certain kinds of excellence, the model embeds that pattern, and because it reproduces "majority" preferences, it feels natural even when it's systematically unfair.
- You can't explain what went wrong. When a human panel makes a questionable decision, you can ask why and surface the reasoning. When an AI judge does, you get a confidence score and a generated explanation that may not reflect how the decision was actually made.
For meritocracy, this is existential. If assessment systems can't see their own biases, can't calibrate, and can't surface dissent, "merit" becomes whatever the model thinks merit is, not what your organisation actually values.
The path forward: calibration, not replacement
The emerging consensus, even as models like GPT-5.2 get dramatically better, is clear: foundation models as judges need a human anchor.
Organisations thinking carefully about this are moving toward hybrid models:
- Use AI for draft judgement or initial screening, where speed and scale matter.
- Run comparative judgement sessions with human experts to generate reliable reference data and catch divergence.
- Use those human comparative datasets to continuously calibrate and audit the AI judges.
- Make the relationship explicit: the AI amplifies human expertise; it doesn't replace it.
This isn't "AI versus humans." It's "AI plus humans, with humans staying in control."
Research on "LLM-as-a-judge done right" confirms this approach. Models work best when paired with structured human feedback, when their outputs are subject to comparative audit, and when organisations maintain a clear, testable theory of what good judgement looks like in their domain.
Critically, organisations that do this well use comparative judgement frameworks to make human expertise visible, auditable, and systematic. They don't just have a panel of judges; they have a panel whose collective wisdom is captured in comparative data, stored, and used to train, monitor, and improve the AI systems that run at scale.
This is where RM Compare enters the story
The question you should be asking - whether you're running hiring, admissions, grants, skills assessment, or any other judgment-intensive process - is not "should we use GPT-5.2 or human judges?" It's "how do we build and maintain human-grounded expertise that keeps any AI we use honest?"
RM Compare generates that human-grounded expertise at scale. Adaptive Comparative Judgement does three things simultaneously:
- It produces reliable human judgements. Not one person's opinion, but a calibrated, statistically-robust signal of collective expert judgement.
- It makes expert judgement visible and transferable. The comparative data becomes a dataset that is a record of what your best judges actually valued, where they agreed, where they diverged.
- It creates the reference standard you need to audit AI. If you want to know whether GPT-5.2 or any other foundation model is staying aligned with your actual human expertise, you need a trusted human benchmark. RM Compare is that benchmark.
Practically, this means:
- If you're considering using GPT-5.2 to screen applications or rank submissions, run a parallel RM Compare session first. You'll get a reliable human ranking and rich data on why expert judges valued certain work.
- Use that ranking and dataset to calibrate your AI judge. Test it. See where it diverges.
- Run periodic RM Compare sessions as your domain, market, or priorities change. Let human expertise refresh and re-calibrate the AI.
- When something goes wrong - when you realize the AI is consistently undervaluing certain candidates - you have the human data to diagnose and fix it.
The honest conversation
GPT-5.2 represents genuine progress. The improvements in reasoning, accuracy, and professional task performance are real and valuable.
But the gap between "can approximate majority preferences in well-specified tasks" and "can replace human judgement in high-stakes assessment" hasn't closed. If anything, as models get better at surface-level performance, the risk of over-trusting them without validation increases.
The organisations that will succeed in the next decade are not the ones who bet entirely on AI judgement, nor the ones who reject it. They're the ones who ask: How do we use AI to amplify human expertise while keeping human expertise at the centre?
That question has a practical answer. And on the day GPT-5.2 launches, RM Compare's role in that answer is clearer than ever.
The models can simulate preferences. But organisations that care about real merit, fairness, and trust need more than simulation. They need calibration, visibility, and the ability to say "this is what we actually value", and mean it.