- Opinion
Lessons from Software Engineering: Why AI in Assessment May Be Solving the Wrong Problem
Every few years, a body of evidence emerges from an unexpected direction that turns out to be exactly what education needed to hear. We think this might be one of those moments.
The 2025 DORA State of AI-Assisted Software Development report, a large-scale study drawing on survey responses from nearly 5,000 technology professionals worldwide and more than 100 hours of qualitative research, was written for software engineering leaders. But its central finding has implications that reach well beyond the world of code.
It is worth reading carefully, because we believe it maps with unusual precision onto the debate now taking shape in education assessment.
What the software engineers found
The report's headline finding is disarmingly simple: AI is an amplifier.
"It magnifies the strengths of high-performing organisations and the dysfunctions of struggling ones."
Teams with strong practices, clear workflows, and shared expertise saw genuine benefits from AI adoption. Teams without those foundations found that AI made their existing problems more visible — and more damaging — than before. The report puts it in terms worth sitting with: AI creates "localised pockets of productivity that are often lost to downstream chaos."
This is not a story about AI being good or bad. It is a story about AI being a mirror reflecting back the underlying capabilities of the system it is placed into, and amplifying whatever it finds.
There is a second, subtler finding. In 2024, when AI adoption first became widespread in software teams, it increased individual productivity while simultaneously reducing delivery stability. Developers were moving faster. The systems around them were not ready to carry that speed. The local gain was real; the systemic benefit was not. Speed went up. Reliability went down.
The authors describe this as a second-order effect, an outcome that only becomes visible once the initial glow of productivity has faded, and once you start looking at what the surrounding system is doing rather than what the individual is doing.
Why this matters for assessment
We raise this not to make a point about software, but because we think the pattern is directly relevant to what is happening right now with AI in education assessment.
The efficiency case for AI in assessment is real and, on its own terms, compelling. Faster feedback. Reduced teacher workload. Consistent application of criteria at scale. These are genuine benefits, and we do not dismiss them.
But the question the DORA findings prompt us to ask is a different one: what is the surrounding system doing while we capture those local gains?
In assessment, the surrounding system includes several things that are easy to overlook when the conversation is dominated by speed and cost.
It includes teacher expertise - the tacit, accumulated knowledge of what quality looks like across a range of student work, built through repeated acts of careful reading and judgement. It also includes professional calibration - the process by which a community of assessors develops shared standards, catches its own inconsistencies, and improves over time. And it also includes the learning signal - the feedback loop between the act of assessing and the improvement of teaching, which depends on teachers being genuinely close to student work rather than supervising outputs they did not produce.
These are not soft or sentimental concerns. They are the foundations on which the validity of assessment rests. And they are precisely the kinds of foundations that the DORA report warns are vulnerable to being quietly eroded by AI adoption that focuses on local productivity without attending to the system as a whole.
Research from the National Centre for Improvement in Educational Assessment synthesises the emerging evidence clearly: it remains genuinely unclear whether AI-generated feedback aligns with the principles that make feedback effective, the right level of detail, at the right time, for the right person (NCIEA, 2025). A large-scale Australian study across four universities found that students rated teacher feedback as significantly more helpful and more trustworthy than AI feedback, even when they used both (Taylor & Francis, 2025).
Meanwhile, research published in peer-reviewed medical and educational literature now explicitly names both deskilling and never-skilling as risks of early AI adoption. It points to the failure to develop essential capabilities in the first place when AI is introduced before those capabilities are established (PMC, 2026). The RSIS International review is particularly pointed: AI grading tools "limit holistic evaluation negatively impacting the teacher's nuanced judgment of student intent, effort, and learning trajectory. Overdependence may desensitise instructors from identifying subtle learning difficulties, creative approaches, or ethical dilemmas in student work" (RSIS International, 2025).
The Comparative Judgement question
Comparative Judgement is a methodology with a strong and well-established evidence base. Its reliability comes from aggregating a large number of human decisions - each one a small act of expert judgement - into a statistically robust measurement scale. The act of comparison is not merely a mechanism for producing a rank order. It is a process through which assessors develop and calibrate their understanding of quality. The validity of the output depends on the quality of the human judgement that goes into it.
There is now a push, visible in the market, to replace the majority of those human judgements with AI. Some providers recommend a configuration in which AI handles 90% of comparisons, with human judges covering the remaining 10%. The efficiency argument is straightforward with teacher judging time is reduced by approximately 90%.
We think it is worth asking what happens to the 90% of professional calibration that is no longer taking place.
The DORA report makes a point that applies here with some precision: "Successful AI adoption is a systems problem, not a tools problem. The value of AI is unlocked not by the tools themselves, but by the surrounding technical and cultural environment." In assessment terms: the question is not whether AI can make a reliable comparative judgement in isolation. It is what happens to the system - the teachers, the standards, the expertise - when AI makes most of the judgements most of the time.
This is not a criticism of AI's capability to judge necessarily (although there are issues here). It is a question about what the practice of judging does for the people doing it, and what the system loses when that practice is largely automated away.
What thoughtful AI adoption looks like
The DORA report does not argue against AI. It argues for AI adoption that is treated as a systems transformation rather than a tool deployment. Its practical conclusion is that AI should be directed at the work requiring the least professional judgement, in order to free human capacity for the work requiring the most.
In assessment, that principle points somewhere quite specific. AI can helpfully automate the administrative burden of assessment such as transcription, organisation, pattern analysis across large datasets, structuring feedback comments that teachers have already provided in their own words. These are meaningful time savings that do not compromise the professional judgement at the heart of the process.
What it suggests more caution about is using AI to replace the acts of judgement themselves. These are the comparisons, the evaluations, the decisions about quality that are both the output and the engine of high-quality assessment practice.
The Chartered College of Teaching frames it well: "matching AI use to assessment stakes matters. Maintaining the teacher's central role, especially in assessment design and quality assurance, is essential" (Chartered College, 2025).
Our position
At RM Compare, we have followed this evidence carefully and with genuine interest. We are not sceptical of AI as a technology. We are cautious about the pace at which consequential decisions about assessment methodology are being made before the second-order effects are well understood.
The software engineering community has had several years of live data on what AI adoption does to the systems around it. Education is only beginning that journey. It seems worth drawing on what has already been learned.
RM Compare is built on the principle that the expertise of teachers and subject specialists is not an inefficiency to be designed around - it is the source of the validity we are trying to scale. Our approach to AI reflects that: we are focused on where technology can extend and support professional judgement, not on where it can replace the need to exercise it.
The DORA report ends with a line we find worth quoting directly: "For organisations ready to look, the reflection AI offers becomes a roadmap."
In assessment, we think the reflection is beginning. The roadmap is still being drawn.