Can AI Be Trusted to Handle Complex Work? New Benchmark Reveals Alarming Document Degradation

Recent research from Microsoft has raised serious questions about the reliability of large language models (LLMs) when tasked with complex, multi-step editing workflows. The study, captured in a preprint paper titled "LLMs Corrupt Your Documents When You Delegate," introduces a benchmark called DELEGATE-52 that simulates real-world knowledge worker tasks across 52 professional domains. The results show that even the most advanced LLMs introduce substantial errors over repeated interactions, leading to significant document degradation. This Q&A explores the findings, expert reactions, and what they mean for enterprise adoption of generative AI.

What exactly did the DELEGATE-52 benchmark test?

The benchmark, created by Microsoft researchers Philippe Laban, Tobias Schnabel, and Jennifer Neville, simulated workflows a knowledge worker might perform. It included 310 work environments across 52 domains, such as coding, crystallography, genealogy, and music sheet notation. Each environment contained real documents averaging about 15,000 tokens in length. The test required the LLMs to perform between five and ten complex editing tasks per environment, mimicking user delegation. The goal was to measure how well the models could preserve document integrity over multiple edits without human oversight. The benchmark was designed to be rigorous: tasks were multi-step, context-dependent, and required careful reasoning to avoid introducing errors.

Can AI Be Trusted to Handle Complex Work? New Benchmark Reveals Alarming Document Degradation — Source: www.computerworld.com

What were the key findings about LLM reliability?

The study's abstract states a stark conclusion: "Current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction." In practice, this means that when an LLM makes an edit, it might occasionally introduce a mistake that goes unnoticed, and over a series of edits, these errors accumulate. The researchers found that models like Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 (referred to as frontier models) lost an average of 25% of document content after 20 delegated interactions. Across all 19 models tested, the average degradation was a staggering 50%. This suggests that without robust guardrails, LLMs cannot be trusted to autonomously maintain the integrity of complex documents.

How much document degradation did frontier models cause?

According to the paper, the worst-performing frontier models saw document content drop by about 25% over 20 interactions. This isn't a simple loss of text; the models introduced errors that corrupted data, changed meanings, or deleted important sections. The average across all 19 models was a 50% degradation, meaning that after a series of edits, half of the original content was either missing or altered incorrectly. The researchers emphasize that these errors are "sparse but severe"—they don't occur frequently, but when they do, they can be catastrophic. For example, in a legal document, a single incorrect word could change a contract term. In a medical report, a missing data point could lead to misdiagnosis. The findings highlight a critical flaw in relying on current LLMs for back-office automation without human oversight or specialized error-checking mechanisms.

What did experts say about the practical implications for enterprise AI?

Brian Jackson, principal research director at Info-Tech Research Group, found the benchmark valuable but cautioned against overgeneralization. He said, "Putting a list of LLMs to the test across different work domains yields a lot of useful insights… however, what we shouldn't conclude from this is that, because these foundation models caused document degradation after 20 edits, they can't be used to automate work in a certain field. It just means they can't do all of the work as they are currently constructed." Sanchit Vir Gogia, chief analyst at Greyhound Research, was more direct: "The Microsoft paper should be read as a serious warning about delegated AI, not as a claim that enterprise AI has failed. That distinction matters." Both experts agree that the results don't doom enterprise AI, but they underscore the need for stronger automation designs and guardrails.

How can enterprises mitigate the risks of delegated AI?

Jackson recommends designing automation flows with stronger guardrails. For instance, instead of a single LLM performing all editing tasks, enterprises can deploy multiple agents that play different roles—one makes the edits, and another checks for errors and makes corrections. This multi-agent approach can catch many of the sparse but severe errors the study identified. Other mitigation strategies include implementing human-in-the-loop review for critical documents, using version control to track changes, and limiting the number of consecutive automated edits. The key is to recognize that current LLMs are not autonomous workers; they are tools that need careful oversight and structured workflows to prevent cumulative corruption. Enterprises should also consider using domain-specific fine-tuned models that may perform better on specialized tasks than general-purpose frontier models.

Why is it important to distinguish between delegated AI failure and enterprise AI failure?

Sanchit Vir Gogia emphasized this distinction: the Microsoft paper shows that LLMs are unreliable when left to their own devices over many interactions, but that doesn't mean AI has no place in the enterprise. Enterprise AI encompasses many techniques, including supervised machine learning, rules-based systems, and human-in-the-loop processes. The failure mode highlighted in the study is specific to delegation—giving a single LLM full autonomy to edit documents. In practice, enterprises rarely do that without safeguards. The real takeaway is not that AI is failing, but that we need to design AI solutions that compensate for the weaknesses of current foundation models. By understanding these limitations, businesses can build more resilient systems that leverage AI's strengths while mitigating its risks.

What domains were included in the benchmark?

The DELEGATE-52 benchmark covered 52 professional domains, ranging from technical fields like coding and crystallography to more creative ones like genealogy and music sheet notation. Each domain had several work environments with real documents. This diversity is important because it tested the LLMs across a wide spectrum of tasks and writing styles. For example, a coding task might involve debugging or refactoring code, while a genealogy task could require editing family trees or historical records. The inclusion of domains like crystallography and music notation shows that the researchers deliberately challenged the models with specialized, structured data. The results suggest that no matter the domain, current LLMs are prone to the same kind of degradation when given repeated editing tasks, indicating a fundamental limitation in their ability to maintain long-term context and accuracy.