Product

"How hard is this task?" - assessing difficulty

By Mark House

24th mar 2026

Comparative judgement is most commonly used to answer a simple question: Which of these pieces of work is better? Teachers and examiners compare two responses, choose one, and behind the scenes an algorithm turns many such decisions into a reliable rank order and a scale. That idea now underpins everything from trust‑wide writing assessments to high‑stakes awarding.

The same engine can answer a different question: Which of these tasks is harder?

Instead of comparing pieces of student work, we compare the questions, tasks or assignments themselves. That shift opens up a new set of possibilities: building difficulty ladders for curricula, assembling more balanced tests, reducing pre‑testing, and checking that parallel forms are genuinely comparable.

Case study - sheet music

We worked with an music awarding organisation organisation who wanted to introduce some new musical genres and instruments. To do this they needs a bank of sheet music that had been standardised by difficulty.

The existing process was time consuming, expensive and unreliable. It involved getting a team of experts to find and then 'mark' the relative difficulty of each piece. This inevitably involved countless rounds of moderation until an agreement could be made.

With RM Compare the process was much simpler. The expert judges were simply asked to look at sheet music in pairs and determine which one was more difficult. RM Compare did the hard bit of bringing the judgements together into a reliable rank order.

Output and outcomes was dramatically improved. Crucially it allowed the Awarding Organisation to respond to market demand much faster, while at the same time being confident in the robustness of their standards.

What do we mean by task difficulty?

In classical and modern test theory, item difficulty usually means “how many candidates get this question right”, expressed as a facility value or a parameter on an ability scale. That kind of difficulty is powerful, but it relies on large datasets, field trials, and the right modelling expertise.

Teachers, examiners and hiring managers rarely talk that way. They say things like “this problem is a bit tricky for Year 8” or “this scenario is much trickier than last year’s one”. Humans are naturally better at relative judgements (“this is harder than that”) than absolute ones (“this has a score of 0.63”).

Comparative judgement leans into that strength. Instead of asking judges to predict a number, we ask them to compare two tasks and decide which is the more demanding. From many such comparisons, we can infer a difficulty scale: a ranked list, and – if we choose – numerical difficulty estimates that behave much like traditional item difficulties.

Why use comparative judgement for difficulty?

There are three main reasons this approach is attractive.

It works when data are thin. You can estimate relative difficulty before you have thousands of test takers, which is especially useful for new curricula, new item banks, or rapidly‑changing domains.
It captures human expertise. Subject specialists can see nuances in tasks (content, language load, cognitive demand) that aren’t obvious from historical facility statistics alone, particularly for complex performance tasks.
It produces actionable outputs. A clear difficulty ladder helps you design progressions, assemble balanced papers, and choose which tasks to trial or retire.

The goal is not to replace statistical analysis, but to provide a structured, scalable way of turning expert intuition into data that can feed curriculum design, item banking and standards maintenance.

Use in any context

Sector	Primary “difficulty” job to be done
Schools & school groups	Align curriculum expectations and build sensible progressions of tasks
High‑volume recruitment	Calibrate work‑sample and scenario tasks across cohorts and campaigns
Global awarding	Support item banking and standards‑maintenance alongside equating
Higher education	Design tiered admissions and coursework tasks that discriminate fairly
Training & compliance	Ensure cross‑site assessments are comparable and auditable in difficulty

Because the stakes vary, so does the tolerance for approximation. A multi‑academy trust might happily use a difficulty ladder as a guide for curriculum conversations, while an exam board will want formal evidence that CJ‑based estimates behave sensibly alongside statistical equating. Future posts will make those differences explicit rather than pretending one accuracy standard fits all.

Group	Name	Domain	Expiration	Security	Purpose
necessary	csrftoken	compare.rm.com	365 days, 0:00:00	HTTP	Helps prevent CSRF attacks
necessary	_cf_bm	vimeo.com	1 day, 0:00:00	HTTP	Used to distinguish between humans and bots
preferences	wtm	compare.rm.com	365 days, 0:00:00	HTTP	Used to store users cookie preference choices
statistics	_ga	rm.com	365 days, 0:00:00	HTTP	Registers a unique ID used to generate statistical data on how visitor used the website
statistics	_ga_#	rm.com	365 days, 0:00:00	HTTP	Used by Google Analytics to collect data on user visits to the website
statistics	_hp2_#	rm.com	1 day, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	_hp2_id.#	rm.com	365 days, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	_hp2_ses_props.#	rm.com	1 day, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	vuid	vimeo.com	365 days, 0:00:00	HTTP	Collects data on the user's visits to the website
marketing	td	googletagmanager.com	0:00:00	HTTP	Used by Google Tag Manager to collect data on the user behaviour and interaction with the website
marketing	h	heapanalytics.com	0:00:00	HTTP	Collects data on the user behaviour and interaction with the website

Name	Domain	Purpose	Expiration	Security
csrftoken	compare.rm.com	Helps prevent CSRF attacks	365 days, 0:00:00	HTTP
_cf_bm	vimeo.com	Used to distinguish between humans and bots	1 day, 0:00:00	HTTP

Name	Domain	Purpose	Expiration	Security
_ga	rm.com	Registers a unique ID used to generate statistical data on how visitor used the website	365 days, 0:00:00	HTTP
_ga_#	rm.com	Used by Google Analytics to collect data on user visits to the website	365 days, 0:00:00	HTTP
_hp2_#	rm.com	Collects data on the user's navigation and behaviour on the website	1 day, 0:00:00	HTTP
_hp2_id.#	rm.com	Collects data on the user's navigation and behaviour on the website	365 days, 0:00:00	HTTP
_hp2_ses_props.#	rm.com	Collects data on the user's navigation and behaviour on the website	1 day, 0:00:00	HTTP
vuid	vimeo.com	Collects data on the user's visits to the website	365 days, 0:00:00	HTTP

Name	Domain	Purpose	Expiration	Security
td	googletagmanager.com	Used by Google Tag Manager to collect data on the user behaviour and interaction with the website	0:00:00	HTTP
h	heapanalytics.com	Collects data on the user behaviour and interaction with the website	0:00:00	HTTP

"How hard is this task?" - assessing difficulty

Case study - sheet music

What do we mean by task difficulty?

Why use comparative judgement for difficulty?

Use in any context

Cookies