Product

From pilots to products: how organisations can modernise without blowing everything up

By Mark House

23rd mar 2026

Every exam body I talk to feels the same squeeze. Governments want innovation. Schools want recognition for richer work. Generative AI is crashing into the system from all sides. And yet, when results day comes, the only thing that really matters is whether the grades stand up in the media and in court.

With RM Compare, we’ve learned that there is a slower, safer and ultimately more powerful path: build a human‑centred judgement infrastructure first, then let AI work around it. The way to get there isn’t through one huge transformation programme. It’s through a series of deliberately small, well‑designed steps: Pilot → Prototype → Product.

Step 1: Pilot – learn fast, touch lightly

A pilot is where we both get to ask, “Is this even a good idea for your context?”—without anyone pretending it’s already business as usual.

That might mean taking a single component—say, a coursework project or a practical—and running a comparative judgement session alongside the existing marking approach. We’re not trying to re‑engineer their entire system. We’re trying to learn.

In a good pilot, questions are more important than features. Do judges reach more stable decisions when they compare work instead of scoring it? Do standards become easier to articulate and share? Does this give the exam body a story they could explain to teachers, unions and the public?

Technically, the rule is simple: use the platform as it is. We rely on the existing 💻|Studio, 🔗|Hub and 📳|Live tools, and the APIs we’ve already committed to. If we hit a limitation, we don’t hack around it for just one customer; we capture it as product feedback. The pilot earns the right to influence the roadmap, not to fork it.

Step 2: Prototype – prove a pattern, not a one‑off

If a pilot shows promise, the next question is: can this become a repeatable part of how you work?

That’s where the prototype comes in. Here we’re not just asking “does ACJ work on this component?” but “is there a pattern we can use again and again across your portfolio?”

The prototype might take the same comparative judgement workflow and apply it to several components across different subjects. Or it might explore how standards created in RM Compare flow into existing on‑screen marking systems, so that examiner training and monitoring are anchored in real student work. Or it might be a full exam‑series AI validation study, where we check an automated marker against a human judgement standard.

The important thing is that what we do at this stage is describable. We should be able to write it down as a configuration template, a short set of API calls, a playbook that a local team can follow. If the only way to repeat it is to fly the original project team back in and reinvent everything, then we haven’t built a prototype. We’ve just done a bigger pilot.

For the exam body, this phase is about confidence: confidence that they can run this themselves, that it fits their operational and political reality, and that it doesn’t require burning down what already works.

Step 3: Product – treating judgement as infrastructure

Only once a pattern has survived contact with real life do we talk about products in the full sense: contracts, capacity, multi‑year commitments.

At this point, RM Compare stops being “something we used in a project once” and becomes part of the assessment infrastructure. It sits alongside item banking, on‑screen marking and other functions as a standard tool: the place where human judgement is captured, aggregated and governed at scale.

Practically, that might look like an enterprise licence rather than a tangle of one‑off fees. It might look like annual judgement‑capacity bands instead of per‑component deals, so the organisation is free to move components in and out as needs change. It might include a standing option to run AI validation studies whenever they roll out or update automated marking.

More important than the commercial detail is the shift in ownership. The exam body’s own people, and their local partners, own the “final mile”: choosing where to use comparative judgement, running sessions, interpreting outputs. Our job becomes keeping the platform coherent, predictable and well‑documented, so it can support many different stories without having to be rewritten for each one.

Why this path suits high‑stakes assessment

It aligns with how they already manage risk. Research, then pilot, then phased implementation is baked into their culture. A Pilot → Prototype → Product path simply gives that rhythm clearer stages and names in the context of comparative judgement and AI.

It keeps humans at the centre. Comparative judgement uses human experts to define and maintain standards. Those standards can then be used to train examiners, to audit consistency and, if desired, to test whether AI systems are behaving. The infrastructure exists to support human judgement, not to bypass it.

And it preserves optionality. At any point, the organisation can decide that the world isn’t ready, or that other priorities have overtaken this one, and step off the path without having bet the system on a single all‑or‑nothing change.

For us, it also guards the integrity of RM Compare as a product. Instead of vanishing into a few custom national implementations, the platform stays global, with a roadmap that serves all of its users.

Growing together, not bolting on

We’re reflecting this approach in how we talk about RM Compare publicly. Rather than presenting a grab‑bag of services, we’re starting to describe a relationship that can grow in clear stages: a pilot to explore the fit, a prototype to prove a pattern, then a product that becomes part of the exam body’s core infrastructure.

If that’s you, the invitation is straightforward: start small, with one component and a clear question. Use the pilot to learn. Use the prototype to prove a pattern. Then, when you are ready, turn judgement into the infrastructure that everything else can lean on.

Group	Name	Domain	Expiration	Security	Purpose
necessary	csrftoken	compare.rm.com	365 days, 0:00:00	HTTP	Helps prevent CSRF attacks
necessary	_cf_bm	vimeo.com	1 day, 0:00:00	HTTP	Used to distinguish between humans and bots
preferences	wtm	compare.rm.com	365 days, 0:00:00	HTTP	Used to store users cookie preference choices
statistics	_ga	rm.com	365 days, 0:00:00	HTTP	Registers a unique ID used to generate statistical data on how visitor used the website
statistics	_ga_#	rm.com	365 days, 0:00:00	HTTP	Used by Google Analytics to collect data on user visits to the website
statistics	_hp2_#	rm.com	1 day, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	_hp2_id.#	rm.com	365 days, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	_hp2_ses_props.#	rm.com	1 day, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	vuid	vimeo.com	365 days, 0:00:00	HTTP	Collects data on the user's visits to the website
marketing	td	googletagmanager.com	0:00:00	HTTP	Used by Google Tag Manager to collect data on the user behaviour and interaction with the website
marketing	h	heapanalytics.com	0:00:00	HTTP	Collects data on the user behaviour and interaction with the website

Name	Domain	Purpose	Expiration	Security
csrftoken	compare.rm.com	Helps prevent CSRF attacks	365 days, 0:00:00	HTTP
_cf_bm	vimeo.com	Used to distinguish between humans and bots	1 day, 0:00:00	HTTP

Name	Domain	Purpose	Expiration	Security
_ga	rm.com	Registers a unique ID used to generate statistical data on how visitor used the website	365 days, 0:00:00	HTTP
_ga_#	rm.com	Used by Google Analytics to collect data on user visits to the website	365 days, 0:00:00	HTTP
_hp2_#	rm.com	Collects data on the user's navigation and behaviour on the website	1 day, 0:00:00	HTTP
_hp2_id.#	rm.com	Collects data on the user's navigation and behaviour on the website	365 days, 0:00:00	HTTP
_hp2_ses_props.#	rm.com	Collects data on the user's navigation and behaviour on the website	1 day, 0:00:00	HTTP
vuid	vimeo.com	Collects data on the user's visits to the website	365 days, 0:00:00	HTTP

Name	Domain	Purpose	Expiration	Security
td	googletagmanager.com	Used by Google Tag Manager to collect data on the user behaviour and interaction with the website	0:00:00	HTTP
h	heapanalytics.com	Collects data on the user behaviour and interaction with the website	0:00:00	HTTP