When the Black Box Gets Good

For a long time, the debate about AI in assessment focused on capability. Could AI produce judgements that were reliable enough to take seriously? Could it handle open-ended responses, complex performances, or evidence of learning that did not fit neatly into a mark scheme?

That debate is changing.

Across education and assessment, newer AI systems are starting to show that, on some tasks, they can produce results close to those of experienced human judges. That is an important shift. It means the conversation is no longer only about whether AI can participate in assessment. It is increasingly about what kind of assessment system we build around it.

That distinction matters.

Because once the black box starts to look capable, the hardest questions are no longer technical in the narrow sense. They become questions of trust, governance and professional judgement. How do we know when an AI-supported assessment process is behaving well? How do we spot drift? How do we keep standards stable over time? And how do we make sure that the use of AI strengthens assessment rather than hollowing it out?

Those are the questions that matter now.

Capability is only the start

If an AI system can produce scores, rankings, feedback or classifications that broadly align with human judgements, that is a meaningful achievement. It opens the door to faster processes, greater scalability, and new kinds of assessment support.

But capability alone does not create confidence.

Assessment is not just a prediction problem. It is a trust problem. Institutions need to know not only that a system worked once in a pilot, but that it continues to behave appropriately as tasks, cohorts, policies and models evolve. They need to understand where disagreements appear, what counts as acceptable variation, and when human review should take over.

In other words, the real challenge is not simply making AI assessment work. It is making AI assessment governable.

Why the black box still matters

There is a temptation to think that if outcomes look good enough, the black-box problem has gone away. In practice, the opposite is often true.

The better AI systems become, the more likely they are to be used in consequential settings. And the more consequential the setting, the more important it becomes to have a defensible way of checking whether those systems remain aligned with expert human standards.

You do not necessarily need to see inside every model to use AI responsibly in assessment. But you do need a reliable way to evaluate what comes out, to compare it against trusted human judgement, and to keep doing that as the context changes.

That is where validation becomes central.

Human judgement as the reference point

In an AI-rich assessment landscape, human judgement should not disappear from view. Its role changes.

Human judgement becomes the reference point that gives assessment systems legitimacy. It is how institutions define what quality looks like, how they test automated outputs against professional expectations, and how they maintain confidence when those automated systems change.

This is one reason comparative judgement matters so much in the age of AI. It provides a practical, scalable way to establish and refresh human standards on complex work. It does not force every decision into a rigid mark scheme, and it does not ask institutions to trust automation without a benchmark. Instead, it creates a human-grounded standard that can be used to evaluate whatever AI tools sit alongside it.

Why validation layers matter more now

As AI becomes more capable, many organisations will not want a single monolithic assessment tool. They will want combinations of models, agents, workflows and platforms that can adapt over time.

That makes validation layers more important, not less.

A validation layer gives organisations a stable way to test AI-supported assessment against trusted human standards. It creates a place where outputs can be benchmarked, where disagreements can be surfaced, where changes can be monitored, and where confidence can be built with evidence rather than assumption.

This is especially important in assessment because the question is not just whether an output is efficient. The question is whether it supports valid inferences about learning, progression and quality.

RM Compare’s role

This is the role RM Compare is built to play.

RM Compare is not simply a tool for producing results. It is assessment infrastructure: a way to create, share and apply standards through human comparative judgement, and a way to keep those standards visible as AI becomes more embedded in assessment workflows.

That makes RM Compare well suited to a model-agnostic future. Different organisations will use different AI models and different product architectures. What they will all need is a dependable method for checking those systems against trusted human judgement.

That is where RM Compare can add real value.

It can help institutions and partners establish expert benchmarks, compare AI-supported outcomes against those benchmarks, and rerun those comparisons as models, contexts and policies evolve. It can help make AI assessment not just possible, but governable.

More than accuracy

There is another reason this matters.

Assessment is not only about efficiency or accuracy. It is also about professional confidence, educational values and the dignity of the people involved. If AI becomes a black box that simply emits outcomes, institutions may gain speed but lose something important: a visible connection between expert judgement, standards and the meaning of the results.

A strong assessment system should do more than automate. It should help people understand what quality looks like, preserve confidence in how judgements are made, and support better decisions about learning.

That is why the future of AI in assessment cannot be reduced to a race for the most convincing output. The deeper challenge is building systems where powerful AI remains anchored to human standards and human purposes.

The next phase of AI assessment

The next phase of AI in assessment will not be defined simply by whether the black box works. On many tasks, it may work surprisingly well.

It will be defined by whether institutions can adopt AI without surrendering control of standards, transparency of process, or confidence in the results.

That is why the most important question is no longer, “Can AI assess?”

It is, “What kind of assessment infrastructure do we need when AI starts to get good?”

The answer is not less human judgement. It is better ways of capturing it, applying it, and using it to keep AI aligned with the standards that matter.