What the Recent Apple Study Taught Us About AI, Reasoning, and Assessment

After attending the excellent e-Assessment Association conference in London this week I have a little time to reflect, not least on matters of AI in assessment.

Earlier this month Apple published a landmark study that has sent shockwaves through the AI community. The research, titled The Illusion of Thinking, rigorously tested the reasoning abilities of the most advanced AI models—so-called large reasoning models (LRMs) from OpenAI, Google, Anthropic, and others—using a series of classic logic puzzles designed to scale in complexity. The findings have profound implications for how we understand AI’s capabilities, especially in the context of educational assessment.

What Did the Apple Study Do?

Apple’s researchers moved beyond standard benchmarks like maths and coding problems, which often suffer from data contamination and don’t truly test reasoning. Instead, they created controlled puzzle environments—such as the Tower of Hanoi, river crossing, and block stacking—where they could precisely adjust the complexity and observe how both standard large language models (LLMs) and LRMs performed

The results were striking:

  • On simple tasks, standard LLMs were more accurate and efficient.
  • On moderately complex tasks, LRMs with structured reasoning (like Chain-of-Thought prompting) performed better than standard LLMs.
  • On truly complex tasks, both types of models failed completely—their accuracy dropped to zero, regardless of model size or computational resources.

Even when explicitly provided with the correct algorithm, the models could not reliably execute step-by-step instructions on complex problems. The research revealed that what appears to be “reasoning” is often just sophisticated pattern matching, not genuine understanding or logical computation.

When billion-dollar AIs break down over puzzles a child can do, it’s time to rethink the hype.

The Difference Between Complicated and Complex Assessment Items

This distinction is vital for anyone designing or relying on AI for assessment:

  • Complicated items are challenging but can be broken down into clear, well-defined steps. AI models, especially LRMs, can often handle these by leveraging patterns from their training data.
  • Complex items require flexible, context-sensitive reasoning and often involve ambiguity, nuance, or novel situations. Here, the models’ performance collapses—they cannot generalise beyond what they have seen before, nor can they reliably follow explicit instructions for new types of problems.

For assessment, this means that while AI can be competent at grading or ranking straightforward, well-structured tasks, it struggles with the kind of open-ended, context-rich work that is often most valuable in education and professional settings.

This matters in assessment. A lot.

How RM Compare Combines AI and Human Judgment to Meet the Challenge

At RM Compare, we have long recognised both the strengths and the limits of AI in assessment. Our approach is to harness the best of both worlds: using AI to optimize and scale the assessment process, while keeping human expertise at the heart of evaluation.

  • AI for Workflow and Adaptivity: RM Compare uses machine learning to drive the adaptivity and efficiency of the comparative judgement process. The AI ensures that each piece of work is compared in the most informative way, reducing workload and speeding up large-scale assessments.
  • Humans for Complex Judgement: Crucially, RM Compare does not replace human judges. Instead, it empowers them—especially when dealing with complex, nuanced, or creative work where human reasoning, empathy, and contextual understanding are irreplaceable.
  • Chaining and Cognitive Load: Features like chaining help reduce cognitive load for assessors when dealing with complex items, allowing them to build familiarity and make more consistent, informed judgements.
  • Assessing AI-Generated Content: RM Compare can also facilitate the assessment of AI-generated work, supporting innovative approaches like “Learning by Evaluating,” where students judge and learn from both human and AI-created content.

The Takeaway

Apple’s study is a timely reminder: AI is a powerful tool, but it is not a substitute for human reasoning—especially in the realm of complex assessment. At RM Compare, we believe the future of assessment lies in hybrid systems that combine the efficiency and scalability of AI with the depth and nuance of human judgement. This approach not only addresses the limitations highlighted by Apple’s research but also ensures that assessments remain fair, meaningful, and fit for the challenges of a rapidly changing world.

I think we will probably resist the name change!!

References