- Product
"How hard is this task?" - assessing difficulty
Comparative judgement is most commonly used to answer a simple question: Which of these pieces of work is better? Teachers and examiners compare two responses, choose one, and behind the scenes an algorithm turns many such decisions into a reliable rank order and a scale. That idea now underpins everything from trust‑wide writing assessments to high‑stakes awarding.
The same engine can answer a different question: Which of these tasks is harder?
Instead of comparing pieces of student work, we compare the questions, tasks or assignments themselves. That shift opens up a new set of possibilities: building difficulty ladders for curricula, assembling more balanced tests, reducing pre‑testing, and checking that parallel forms are genuinely comparable.
Case study - sheet music
We worked with an music awarding organisation organisation who wanted to introduce some new musical genres and instruments. To do this they needs a bank of sheet music that had been standardised by difficulty.
The existing process was time consuming, expensive and unreliable. It involved getting a team of experts to find and then 'mark' the relative difficulty of each piece. This inevitably involved countless rounds of moderation until an agreement could be made.
With RM Compare the process was much simpler. The expert judges were simply asked to look at sheet music in pairs and determine which one was more difficult. RM Compare did the hard bit of bringing the judgements together into a reliable rank order.
Output and outcomes was dramatically improved. Crucially it allowed the Awarding Organisation to respond to market demand much faster, while at the same time being confident in the robustness of their standards.
What do we mean by task difficulty?
In classical and modern test theory, item difficulty usually means “how many candidates get this question right”, expressed as a facility value or a parameter on an ability scale. That kind of difficulty is powerful, but it relies on large datasets, field trials, and the right modelling expertise.
Teachers, examiners and hiring managers rarely talk that way. They say things like “this problem is a bit tricky for Year 8” or “this scenario is much trickier than last year’s one”. Humans are naturally better at relative judgements (“this is harder than that”) than absolute ones (“this has a score of 0.63”).
Comparative judgement leans into that strength. Instead of asking judges to predict a number, we ask them to compare two tasks and decide which is the more demanding. From many such comparisons, we can infer a difficulty scale: a ranked list, and – if we choose – numerical difficulty estimates that behave much like traditional item difficulties.
Why use comparative judgement for difficulty?
There are three main reasons this approach is attractive.
- It works when data are thin. You can estimate relative difficulty before you have thousands of test takers, which is especially useful for new curricula, new item banks, or rapidly‑changing domains.
- It captures human expertise. Subject specialists can see nuances in tasks (content, language load, cognitive demand) that aren’t obvious from historical facility statistics alone, particularly for complex performance tasks.
- It produces actionable outputs. A clear difficulty ladder helps you design progressions, assemble balanced papers, and choose which tasks to trial or retire.
The goal is not to replace statistical analysis, but to provide a structured, scalable way of turning expert intuition into data that can feed curriculum design, item banking and standards maintenance.
Use in any context
| Sector | Primary “difficulty” job to be done |
|---|---|
| Schools & school groups | Align curriculum expectations and build sensible progressions of tasks |
| High‑volume recruitment | Calibrate work‑sample and scenario tasks across cohorts and campaigns |
| Global awarding | Support item banking and standards‑maintenance alongside equating |
| Higher education | Design tiered admissions and coursework tasks that discriminate fairly |
| Training & compliance | Ensure cross‑site assessments are comparable and auditable in difficulty |
Because the stakes vary, so does the tolerance for approximation. A multi‑academy trust might happily use a difficulty ladder as a guide for curriculum conversations, while an exam board will want formal evidence that CJ‑based estimates behave sensibly alongside statistical equating. Future posts will make those differences explicit rather than pretending one accuracy standard fits all.