Relying on LLMs is nearly impossible when AI vendors keep changing things

Over the years, enterprise IT execs have gotten frighteningly comfortable having little control or visibility over mission-critical apps, from SaaS to cloud and even cybersecurity. But generative AI (genAI) and agentic systems are taking that problem to a new extreme, with vendors able to dumb down a system IT is paying billions for without so much as a postcard.

It’s not necessarily that AI changes are made to boost profits or revenue. Even if we accept the vendor argument that such changes are in the customer’s interest, companies still need for their systems to do on Thursday what they did on Tuesday, let alone what they did when the purchase order was signed.

Alas, that is no longer the case.

Consider a recent report from Anthropic that detailed a lengthy list of changes the company made to some of its AI offerings — including one that explicitly dumbed down answers — without asking or telling customers beforehand.

The report describes various changes the Anthropic team made on their own and then decided to reconsider the move only after users noticed and complained about the drop in quality.

“On March 4, we changed Claude Code’s default reasoning effort from high to medium to reduce the very long latency — enough to make the UI appear frozen — some users were seeing in high mode. This was the wrong tradeoff. We reverted this change on April 7 after users told us they’d prefer to default to higher intelligence and opt into lower effort for simple tasks,” the April 23 Anthropic report said. “On March 26, we shipped a change to clear Claude’s older thinking from sessions that had been idle for over an hour, to reduce latency when users resumed those sessions. A bug caused this to keep happening every turn for the rest of the session instead of just once, which made Claude seem forgetful and repetitive. We fixed it on April 10.”

Our bad — we’ll change it back

The fastest “Oops! Our bad. We’ll change it back” moment came last month. “On April 16, we added a system prompt instruction to reduce verbosity. In combination with other prompt changes, it hurt coding quality and was reverted on April 20,” Anthropic said.

Beyond forcing changes on customers — not necessarily for customers — the AI vendor said the interdependence among complex GenAI systems makes it more difficult to quickly detect performance problems, including weaker answers and the speed of delivering those answers.

“Because each change affected a different slice of traffic on a different schedule, the aggregate effect looked like broad, inconsistent degradation,” Anthropic said. When “we began investigating reports in early March, they were challenging to distinguish from normal variation in user feedback at first, and neither our internal usage nor evals initially reproduced the issues identified.”

This inability to reproduce errors and, for that matter, any behavior at all, is just one of the realities of genAI tools and agents. The fact that the same model is likely to give a different answer to the identical question posed two minutes apart is exactly why reproducibility is so difficult. That’s the case with all AI vendors, but it’s not their fault, in the same way hallucinations and ignored guardrails are not their fault. It’s just how LLMs operate. You want the good?Accept the bad. Blaming genAi technology for inconsistencies is like blaming the fabled scorpion.

All major AI vendors are in an awkward position: When deciding the performance they deliver, they face what looks like a conflict-of-interest. That’s because the vast majority of current enterprise clients are paying for token usage. That gives vendors like Anthropic, OpenAI and others a real financial incentive to make background changes that increase the number of tokens customers need to purchase. Anthropic tried to suggest that its team was trying to reduce problems where token usage was artificially increased.

For example, in its report, Anthropic said it “received user feedback that Claude Opus 4.6 in high effort mode would occasionally think for too long, causing the UI to appear frozen and leading to disproportionate latency and token usage for those users. In general, the longer the model thinks, the better the output. Effort levels are how Claude Code lets users set that tradeoff — more thinking versus lower latency and fewer usage limit hits. As we calibrate effort levels for our models, we take this tradeoff into account in order to pick points along the test-time-compute curve that give people the best range of options.”

Technology often backfires

Sometimes, an effort to help customers backfires because, well, technology hates all of us.

The report details an incident on March 26, where an internal Anthropic change “was meant to be an efficiency improvement. We use prompt caching to make back-to-back API calls cheaper and faster for users. Claude writes the input tokens to the cache when it makes an API request, then after a period of inactivity the prompt is evicted from cache, making room for other prompts. Cache utilization is something we manage carefully.”

Then things got sticky. “The design should have been simple: if a session has been idle for more than an hour, we could reduce users’ cost of resuming that session by clearing old thinking sections. Since the request would be a cache miss anyway, we could prune unnecessary messages from the request to reduce the number of uncached tokens sent to the API.”

Turns out, “the implementation had a bug. Instead of clearing thinking history once, it cleared it on every turn for the rest of the session. After a session crossed the idle threshold once, each request for the rest of that process told the API to keep only the most recent block of reasoning and discard everything before it. This compounded: if you sent a follow-up message while Claude was in the middle of a tool use, that started a new turn under the broken flag, so even the reasoning from the current turn was dropped. Claude would continue executing, but increasingly without memory of why it had chosen to do what it was doing. This surfaced as the forgetfulness, repetition, and odd tool choices people reported. …We believe this is what drove the separate reports of usage limits draining faster than expected.”

And with Claude Opus 4.7, the vendor noted, it “has a notable behavioral quirk” of being “quite verbose. This makes it smarter on hard problems, but it also produces more output tokens.”

To be clear, I’m not suggesting Anthropic was doing anything especially poorly. Indeed, these are the kinds of problems all genAi companies face, and I applaud Anthropic’s transparency in publishing its reasoning openly.. (Anthropic executives do seem to be trying to portray themselves as more ethical and responsible than many of their rivals.)

What the report makes clear, however, is that the AI package your company is spending a lot of money on is entirely within the control of the hyperscalers. They can dumb down answers and even charge you more money by increasing token usage.

They don’t ask your team beforehand for permission to make these kinds of changes. They don’t even routinely disclose the changes after the fact. In many ways, it’s just like a cloud provider changing settings without your knowledge. Your team might have spent two days getting all of the settings just right for operations, security and compliance on Monday afternoon. You wouldn’t want that cloud team to change everything on Tuesday and not mention it. It’s the same story with SaaS.

Now more than ever, trust, honesty and integrity need to be critical vendor differentiations. That’s especially true for AI companies.You need to track accuracy, speed and a dozen other AI variables internally so you can detect any changes as quickly as possible. As boards push harder for IT to try and deliver clean ROI for AI efforts, these monitoring efforts are no longer optional.

Buyer beware indeed.

Our bad — we’ll change it back

Technology often backfires

Relaterte artikler etter nøkkelord