Yes, the model you use is getting dumber
In March 2023, GPT-4 could identify prime numbers with 97.6% accuracy. By June, that figure had cratered to 2.4%. Not a rounding error, not a minor regression, but a 95-point collapse on the same task with the same prompts. If a bridge lost 95% of its load-bearing capacity in three months, someone would go to prison. In AI, the vendor posts a changelog and moves on.
This pattern has repeated with depressing regularity across every frontier provider. Models ship to applause and enterprise contracts get signed on the strength of benchmark screenshots, and then something changes. The model you evaluated is no longer the model answering your customers, and nobody tells you until your production workflow starts producing garbage.
The evidence is not anecdotal
Researchers at Stanford and UC Berkeley tracked this drift formally, comparing GPT-3.5 and GPT-4 snapshots from March and June 2023 across seven tasks. The results were bad enough to make the researchers themselves flinch. GPT-4’s ability to generate directly executable code dropped from 52% to 10%. Its willingness to follow chain-of-thought prompting, one of the most widely used techniques for improving accuracy, degraded without explanation. GPT-3.5 actually improved on some tasks where GPT-4 got worse, which implies that updates to one model’s behaviour were creating unintended regressions in another.
“The magnitude of the changes in the LLMs’ responses surprised us,” James Zou, a Stanford professor and co-author, told The Register. The team’s conclusion was blunt. The behaviour of the “same” LLM service can shift substantially in weeks, and nobody outside the provider knows when or why.
This wasn’t a one-off result that got debated and forgotten. The OpenAI developer forums have become a rolling graveyard of complaints. In September 2025, users running GPT-4.1 reported severe intelligence degradation within 30 days of launch, with complex tool calls and multi-step instructions suddenly failing. Similar threads appeared for GPT-4 Turbo in May 2025. The pattern never varies, and by now it has become depressingly predictable. Works brilliantly at launch, degrades silently, users scramble to figure out what broke.
Why this happens (and why the incentives encourage it)
There are at least four mechanisms that can degrade a deployed model, and most frontier providers are using all of them simultaneously.
Quantisation is the most technically straightforward of the four, and the easiest to understand. A model trained in 16-bit or 32-bit floating-point precision gets compressed to 8-bit or 4-bit integers for serving. The arithmetic is straightforward enough, since a model stored in FP16 needs roughly two bytes per parameter, so a 70-billion-parameter model demands about 140GB of VRAM just for weights. Quantise to 4-bit and you cut that to around 35GB, enough to run on hardware that costs a fraction as much.
The trade-off is supposed to be minimal, and Red Hat’s analysis of over 500,000 evaluations found that 8-bit and 4-bit quantised models showed “very competitive accuracy recovery” on most benchmarks, especially for larger models. But that phrase “most benchmarks” is doing heavy lifting. Quantisation works by rounding, and rounding destroys outlier values. The weights that fire rarely but matter enormously for edge-case reasoning are exactly the weights that get flattened first. For standard tasks you barely notice the difference, but for the specific hard problems your production system was built to handle, the gap can be catastrophic. One developer reported that dynamic quantisation of a 3B-parameter model dropped accuracy from 65.6% to 32.3%, a halving that no benchmark average would predict.
Mixture-of-experts routing is the more interesting culprit, and the one providers talk about least. DeepSeek’s V3, for example, has 671 billion total parameters but only activates about 37 billion per token. The economics are irresistible because you get the capacity of a massive model with the inference cost of a much smaller one. But the router decides which experts handle which queries, and routing decisions are probabilistic. A query that activated your model’s strongest expert subnetwork at launch might get routed differently after an update to the routing logic, or after the provider adjusts load balancing to handle peak traffic. The user sees the same model name in the API response. The actual computation behind it may have changed entirely.
Distillation and model substitution is the elephant in the room that everyone suspects but nobody can prove definitively. Rumours have circulated since mid-2023 that OpenAI routes some queries to smaller, cheaper models behind the same API endpoint. The Gleech.org 2025 AI retrospective put it plainly: “True frontier capabilities are likely obscured by systematic cost-cutting (distillation for serving to consumers, quantisation, low reasoning-token modes, routing to cheap models).” GPT-4.5 was retired after just three months, presumably because the inference costs were unsustainable, even though it still ranked in the top five on LMArena for hallucination reduction nine months later. The model that performed best got killed because it was too expensive to run.
Safety tuning and RLHF adjustments create the subtlest form of drift. When OpenAI tightens content filters or adjusts the model’s tendency to refuse certain queries, those changes ripple through the entire behaviour space. The Stanford study found that GPT-4 became less willing to explain why it refused sensitive questions, switching from detailed explanations to terse “Sorry, I can’t answer that” responses. The model may have become safer by one measure, but it simultaneously became less transparent and less useful for legitimate applications that happened to brush against the updated boundaries.
The economics are doing exactly what you would expect
Running frontier models is staggeringly expensive, and every provider is under pressure to reduce cost-per-token. The maths, as one industry analysis noted, resembles building more fuel-efficient engines and then using the efficiency gains to build monster trucks. Token prices have dropped by a factor of 1,000 in three years, but reasoning models now generate thousands of internal tokens before producing a single visible output, and 99% of demand shifts to the newest model the moment it ships.
Providers respond by doing what any business would do. They optimise for throughput and margin, quantising the weights and routing easy queries to cheaper subnetworks while distilling the flagship into something that passes the benchmarks but costs a tenth as much to serve. The individual techniques are all defensible, but stacked together and applied silently, they create a system where the model’s advertised performance diverges from its delivered performance over time.
DeepSeek made this trade-off explicit and turned it into a business strategy. Its V3 model serves inference at roughly 90% below comparable OpenAI and Anthropic rates, and the MoE architecture that enables this pricing is openly documented. Whatever you think of the approach, at least the engineering trade-offs are visible. The problem is worse when providers make the same trade-offs quietly, behind an API that returns the same model identifier regardless of what actually computed the response.
What this means if you build on top of these models
The practical upshot is unpleasant but straightforward. If your application depends on consistent model behaviour, you are building on sand that shifts without warning. The Stanford researchers recommended continuous monitoring, and they were right, but monitoring alone doesn’t solve the problem, because it tells you something broke without stopping it from breaking.
Pinning to a specific model snapshot helps, where providers offer it, but even snapshots get deprecated. OpenAI maintains them for a few months and then requires developers to migrate. The careful evaluation you ran against the March snapshot becomes irrelevant when you’re forced onto the June version and nobody can tell you exactly what changed.
The deeper issue is one of trust and transparency. When a model provider updates a live model, they are unilaterally changing the behaviour of every application built on top of it. That is not a software update but an undocumented API change, the kind that would trigger outrage in any other engineering discipline. Imagine if AWS silently swapped your database engine for a cheaper one that was “approximately equivalent” on standard benchmarks, and you can begin to see how the AI industry has somehow normalised something that would be career-ending negligence anywhere else.
Where this leaves us
The model you benchmarked, the one that earned the contract, that impressed the board, that your engineers spent weeks building prompts and evaluation harnesses around, is a snapshot of a moving target. Quantisation shaves off the edges while routing sends your queries to whichever expert subnetwork happens to be cheapest that millisecond, and safety updates redraw the boundaries of what the model will and won’t do. None of it shows up in the model name string your application receives in the API response.
Somewhere in a data centre, the accountants and the alignment researchers are both pulling the same model in different directions, one toward cheaper inference and the other toward tighter guardrails, and the engineers who built their products on last month’s version are left checking the forums to figure out why everything stopped working on a Tuesday.
I am a partner in Better than Good. We help companies make sense of technology and build lasting improvements to their operations. Talk to us today: https://betterthangood.xyz/#contact