If you’ve spent any time on LinkedIn or in a business magazine over the past year, you’ve been told that AI is about to do your job. Maybe it’s already replacing the person next to you. Maybe you’re next. The pitch has gotten so loud that even people who know better are starting to believe it. Here’s the thing: the research coming out of the actual AI labs — the ones building these tools — tells a very different story. Two studies published recently, one from Microsoft and one from MIT, both arrived at roughly the same conclusion from different directions. AI in its current form cannot do knowledge work reliably without someone who knows the subject matter checking its work. That second part is the important one, and it’s where you come in.
Ask an AI What Happened at Hershey Chocolate in 1932
Try this the next time someone tells you AI is going to replace human researchers. Open ChatGPT and ask it what happened at Hershey Chocolate in 1932. What comes back will often be a detailed, confident story. It might tell you that 1932 was the year Hershey released their famous chocolate syrup for commercial use. It will sound authoritative. It will read like something a research assistant might hand you.

It’s also wrong. According to the Hershey Community Archives, Hershey began producing chocolate syrup for commercial use in 1926 — not 1932. The AI took a real fact, moved it six years, and wrapped it in enough plausible detail that someone who didn’t already know the answer would have no reason to doubt it. And if you didn’t already know, how would you catch it? You’d have to know enough about Hershey’s corporate history to spot the date error. You’d have to be the subject matter expert.
This isn’t a weird edge case. It’s the default behavior of these systems. When they don’t know something, they don’t tell you they don’t know. They invent an answer that looks exactly like the correct ones. There’s no warning label on the fabricated parts. They’re served with the same confidence as the parts that are right.
This Is Why the Expert Problem Hasn’t Been Solved
The people getting real work done with AI right now are the ones who already know the answer. A lawyer using AI to draft a motion can tell you instantly whether the cited case law is real. A bookkeeper using AI to categorize transactions can catch when a Cost of Goods Sold entry should have gone somewhere else. An HR manager using AI to draft a policy document can tell when the referenced employment standard doesn’t apply in their province.
Take the expert out of the loop, and the system fails quietly. That “quietly” is what makes this dangerous in a business setting. The AI doesn’t stamp a big red WRONG across the parts it’s made up. The output looks identical whether the model actually knew the answer or invented one.
This is the core reason every serious attempt to replace a knowledge worker with AI has run into the same wall. The AI can produce output all day long. Someone still has to verify it. And the person who verifies it needs to know the subject well enough to catch what the AI got wrong. That person — the one with the domain knowledge — is still the load-bearing part of the system.
Microsoft Tested This Systematically and the Results Should Make Everyone Pause
This April, Microsoft Research published a paper that set out to measure exactly how bad the problem is. They built a test called DELEGATE-52, which simulates the kind of delegated work that would actually let an employer reduce headcount: twenty back-and-forth editing tasks across fifty-two different professional areas. Accounting ledgers. Legal documents. Engineering drawings. Music notation. Recipes. Contract terms. The test was designed so that if the AI did its job perfectly, each document would end up exactly where it started. Any drift represents damage the model introduced and didn’t notice.
Nineteen models were tested, including the biggest and most expensive ones from every major AI lab. The three best performers — Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 — corrupted an average of 25% of document content by the end of the workflow. The rest of the field averaged 50%. Out of fifty-two professional areas, the top-performing model was rated “ready for delegation” in only eleven.
That includes the very top of the line. The models currently being marketed as the ones that will replace professionals cannot, in a systematic test, reliably do professional work.
The Failure Mode Is Quiet, Then Catastrophic
The Microsoft team also looked at how the errors appear. If AI degraded its output gradually, you could catch problems early. That’s not what happens. The models handle most rounds of work just fine, and then drop ten to thirty percentage points of accuracy in a single catastrophic round. Roughly 80% of the total damage came from these rare, severe failures rather than from steady decline. The AI looks fine, looks fine, looks fine — and then it isn’t.
This matches the pattern of the Hershey hallucination. Ask an AI about something it actually knows, and the answer will usually be correct. Ask it about something it doesn’t know, or something on the edge of its knowledge, and it will invent an answer that looks exactly like the correct ones. The errors are unflagged. They’re invisible. They look like the rest of the output.
MIT Found the Same Thing from a Different Angle
A few months before the Microsoft paper, MIT’s NANDA initiative published a study called The GenAI Divide, based on leadership interviews, employee surveys, and analysis of 300 enterprise AI rollouts. The headline finding: 95% of enterprise AI pilots delivered zero measurable return on investment. Companies had collectively spent somewhere between $30 and $40 billion on these projects, and most of them couldn’t point to any impact on revenue or profit six months later.
The MIT researchers were careful about why. It wasn’t that the AI models were broken. It was that the tools didn’t integrate with how actual work gets done. They didn’t know the specific rules of a given industry. They didn’t catch their own mistakes. They needed human experts in the loop to produce reliable output, and when companies tried to remove those experts, the value disappeared. The study’s lead author called it a “learning gap” between the tools and the organizations using them, which is a polite way of saying that AI doesn’t know what it doesn’t know, and neither do the people buying it.
What This Means for Your Job
If you work in a role that requires knowing things — how a contract should be structured, what a clean set of books looks like, when a technical drawing doesn’t add up, why a particular customer needs a particular approach — the research is pretty clear that your job isn’t going anywhere soon. Not because AI isn’t getting better. It is. But “better than it was” is still a long way from “good enough to run without supervision.” The jobs at real risk right now are the ones where nobody was checking the work to begin with, and those jobs were on borrowed time before AI showed up.
What AI is actually doing, to the people using it well, is acting as a fast first-drafter. The human still provides the expertise. The human still catches the errors. The human is still the one who knows whether the AI made up a Hershey fact or got it right. That role isn’t being eliminated. It’s being made a bit more efficient, which is a very different thing.
The Honest Take
The boring truth is that the research from the AI labs is more cautious about AI than the marketing from the AI vendors. Microsoft, which sells AI. Anthropic, which makes Claude. Google, which makes Gemini. Their own research teams are publishing papers that say, in effect, these systems can’t be trusted to run on their own. Meanwhile, the conference speakers telling you AI is about to replace you generally aren’t the people building the models. They’re the people selling the idea.
If you’re feeling pressured by the hype, consider the source. Then consider what the actual researchers are publishing. They’re telling you that their systems still need experts in the loop, that they fabricate facts confidently, and that even their best models can’t be trusted with unsupervised work in most professional areas. When the people building the technology are telling you your job is safe, it’s probably worth believing them.

