AI & ML

What the Recent Apple Study Taught Us About AI, Reasoning, and Assessment

By Mark House

12th jun 2025

After attending the excellent e-Assessment Association conference in London this week I have a little time to reflect, not least on matters of AI in assessment.

Earlier this month Apple published a landmark study that has sent shockwaves through the AI community. The research, titled The Illusion of Thinking, rigorously tested the reasoning abilities of the most advanced AI models—so-called large reasoning models (LRMs) from OpenAI, Google, Anthropic, and others—using a series of classic logic puzzles designed to scale in complexity. The findings have profound implications for how we understand AI’s capabilities, especially in the context of educational assessment.

What Did the Apple Study Do?

Apple’s researchers moved beyond standard benchmarks like maths and coding problems, which often suffer from data contamination and don’t truly test reasoning. Instead, they created controlled puzzle environments—such as the Tower of Hanoi, river crossing, and block stacking—where they could precisely adjust the complexity and observe how both standard large language models (LLMs) and LRMs performed

The results were striking:

On simple tasks, standard LLMs were more accurate and efficient.
On moderately complex tasks, LRMs with structured reasoning (like Chain-of-Thought prompting) performed better than standard LLMs.
On truly complex tasks, both types of models failed completely—their accuracy dropped to zero, regardless of model size or computational resources.

Even when explicitly provided with the correct algorithm, the models could not reliably execute step-by-step instructions on complex problems. The research revealed that what appears to be “reasoning” is often just sophisticated pattern matching, not genuine understanding or logical computation.

When billion-dollar AIs break down over puzzles a child can do, it’s time to rethink the hype.

The Difference Between Complicated and Complex Assessment Items

This distinction is vital for anyone designing or relying on AI for assessment:

Complicated items are challenging but can be broken down into clear, well-defined steps. AI models, especially LRMs, can often handle these by leveraging patterns from their training data.
Complex items require flexible, context-sensitive reasoning and often involve ambiguity, nuance, or novel situations. Here, the models’ performance collapses—they cannot generalise beyond what they have seen before, nor can they reliably follow explicit instructions for new types of problems.

For assessment, this means that while AI can be competent at grading or ranking straightforward, well-structured tasks, it struggles with the kind of open-ended, context-rich work that is often most valuable in education and professional settings.

This matters in assessment. A lot.

How RM Compare Combines AI and Human Judgment to Meet the Challenge

At RM Compare, we have long recognised both the strengths and the limits of AI in assessment. Our approach is to harness the best of both worlds: using AI to optimize and scale the assessment process, while keeping human expertise at the heart of evaluation.

AI for Workflow and Adaptivity: RM Compare uses machine learning to drive the adaptivity and efficiency of the comparative judgement process. The AI ensures that each piece of work is compared in the most informative way, reducing workload and speeding up large-scale assessments.
Humans for Complex Judgement: Crucially, RM Compare does not replace human judges. Instead, it empowers them—especially when dealing with complex, nuanced, or creative work where human reasoning, empathy, and contextual understanding are irreplaceable.
Chaining and Cognitive Load: Features like chaining help reduce cognitive load for assessors when dealing with complex items, allowing them to build familiarity and make more consistent, informed judgements.
Assessing AI-Generated Content: RM Compare can also facilitate the assessment of AI-generated work, supporting innovative approaches like “Learning by Evaluating,” where students judge and learn from both human and AI-created content.

The Takeaway

Apple’s study is a timely reminder: AI is a powerful tool, but it is not a substitute for human reasoning—especially in the realm of complex assessment. At RM Compare, we believe the future of assessment lies in hybrid systems that combine the efficiency and scalability of AI with the depth and nuance of human judgement. This approach not only addresses the limitations highlighted by Apple’s research but also ensures that assessments remain fair, meaningful, and fit for the challenges of a rapidly changing world.

I think we will probably resist the name change!!

Group	Name	Domain	Expiration	Security	Purpose
necessary	csrftoken	compare.rm.com	365 days, 0:00:00	HTTP	Helps prevent CSRF attacks
necessary	_cf_bm	vimeo.com	1 day, 0:00:00	HTTP	Used to distinguish between humans and bots
preferences	wtm	compare.rm.com	365 days, 0:00:00	HTTP	Used to store users cookie preference choices
statistics	_ga	rm.com	365 days, 0:00:00	HTTP	Registers a unique ID used to generate statistical data on how visitor used the website
statistics	_ga_#	rm.com	365 days, 0:00:00	HTTP	Used by Google Analytics to collect data on user visits to the website
statistics	_hp2_#	rm.com	1 day, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	_hp2_id.#	rm.com	365 days, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	_hp2_ses_props.#	rm.com	1 day, 0:00:00	HTTP	Collects data on the user's navigation and behaviour on the website
statistics	vuid	vimeo.com	365 days, 0:00:00	HTTP	Collects data on the user's visits to the website
marketing	td	googletagmanager.com	0:00:00	HTTP	Used by Google Tag Manager to collect data on the user behaviour and interaction with the website
marketing	h	heapanalytics.com	0:00:00	HTTP	Collects data on the user behaviour and interaction with the website

Name	Domain	Purpose	Expiration	Security
csrftoken	compare.rm.com	Helps prevent CSRF attacks	365 days, 0:00:00	HTTP
_cf_bm	vimeo.com	Used to distinguish between humans and bots	1 day, 0:00:00	HTTP

Name	Domain	Purpose	Expiration	Security
_ga	rm.com	Registers a unique ID used to generate statistical data on how visitor used the website	365 days, 0:00:00	HTTP
_ga_#	rm.com	Used by Google Analytics to collect data on user visits to the website	365 days, 0:00:00	HTTP
_hp2_#	rm.com	Collects data on the user's navigation and behaviour on the website	1 day, 0:00:00	HTTP
_hp2_id.#	rm.com	Collects data on the user's navigation and behaviour on the website	365 days, 0:00:00	HTTP
_hp2_ses_props.#	rm.com	Collects data on the user's navigation and behaviour on the website	1 day, 0:00:00	HTTP
vuid	vimeo.com	Collects data on the user's visits to the website	365 days, 0:00:00	HTTP

Name	Domain	Purpose	Expiration	Security
td	googletagmanager.com	Used by Google Tag Manager to collect data on the user behaviour and interaction with the website	0:00:00	HTTP
h	heapanalytics.com	Collects data on the user behaviour and interaction with the website	0:00:00	HTTP

What the Recent Apple Study Taught Us About AI, Reasoning, and Assessment

What Did the Apple Study Do?

The Difference Between Complicated and Complex Assessment Items

How RM Compare Combines AI and Human Judgment to Meet the Challenge

The Takeaway

References

Cookies