The Machines Will Not Replace Us
Three things had to be true for AI to replace you.
I’d wager there’s a slide in your board deck right now showing how much AI did for you this year. How much it saved the business. A percent off cost-to-serve. A headcount you didn’t backfill. A tidy little efficiency gain. The number is on the slide because every leader in the org had to ship an AI win in the first half of the 2026 to satisfy plan and absence is itself a flag. The room will nod when it’s presented and move on. No one will ask what’s underneath.
But the CFO will. Maybe not next quarter, but soon. When the Q4 invoices land and somebody with a calculator does the math the AI agent skipped when it drafted the slide. The CFO who was eyeing your headcount request last year is looking at the inference invoice now and wondering if the promised savings actually showed up. The pressure that came from above to adopt AI fast is about to come back around to demand an explanation for what it cost.
The machines will not replace us. Not because we won a moral argument about creativity, or dignity, or what work means. Because the math doesn’t work. The compute cost is exceeding the cost of the labor it was supposed to replace, and the people who have been selling the shovels to the gold rush are saying so out loud.
Uber burned its entire 2026 AI coding budget in four months — 5,000 engineers vastly outpacing the financial model. Four. Months. Nvidia’s own VP of Applied Deep Learning — at the company that sells the compute — said the cost of said compute “is far beyond the costs of the employees.”
The call is coming from inside the house.
I’m not a CFO or an accountant. I’m a millennial marketer who’s watched a lot of teams and executive leadership fall in love with numbers that flatter them — the ROAS figure that hides contribution margin, the AI deployment that hides the cleanup tax, the workflow that hides where the judgment was supposed to live. The AI savings line is another imaginary number that flatters and misleads.
None of this is to say AI doesn’t create value. It does — when it’s used as a tool. The teams running it as a harness for their judgment are pulling ahead of those who let the model do their thinking for them. The replacement thesis is a different equation entirely. It has always assumed that judgment — human or harness — isn’t necessary. The technology and the math say it is.
What the math actually costs
The savings figure isn’t on the slide because anyone audited it, but because it had a singular purpose. Boards expected an AI productivity number this year. CFOs needed something to point at when procurement asked what the seven-figure tooling spend was actually buying. The figure’s absence would have been a flag, so somebody made one.
It’s a half-truth. The figure counts one side of a two-sided ledger and calls the gap a savings. Labor avoided in the numerator and the compute incurred didn’t make it onto the slide.
The other side is bigger than the slide allows. Compute, sure. Also the management overhead that materialized to keep the agents engaged, on task, and performant. The procurement renegotiation when the usage blew past the tier two quarters in. The customer service hires after the chatbot ate the trust the brand had been paying to build. None of that is on the slide, but all of it appears on the PNL, whether or not you can see it.
Gartner did this audit at scale. They studied 350 enterprise executives at companies with at least $1 billion in annual revenue. 80% of those companies executed workforce reductions tied to AI adoption. Not one showed a meaningful correlation between those cuts and higher ROI. Headcount went down. Margins didn’t move. The math didn’t pencil.
Microsoft figured it out faster than most. Earlier this month, the company announced it was canceling internal Claude Code licenses — six months after a glowing internal rollout — because Claude Code was running $500 to $2,000 per engineer per month.
“But tokens are getting cheaper.” Yeah, that’s true. They are. It’s expected that per-token costs will fall about 90 percent by 2030, but it doesn’t matter. Agentic AI drives twenty-four times the token consumption over the same stretch. Cheaper units, vastly more of them, and the bill climbs exponentially. Gartner’s own analyst said it without blinking: don’t “confuse the deflation of commodity tokens with the democratization of frontier reasoning.” A token getting cheaper isn’t your AI getting cheaper. A falling CPM never meant your ROAS had improved either.
New dog, same trick.
What the technology can’t do
Even if the math worked — and it doesn’t — the technology underneath wouldn’t get you to replacement either simply because of how these models are built.
A large language model has frozen weights after training. It doesn’t learn from your conversation. It can’t update from yesterday’s mistakes. Memory features and retrieval pipelines expand what the model can search through, not what it can learn. Every session is its first session. Every interaction starts from zero context about the work you actually do, even with the best infrastructure underneath.
Rodney Brooks put it more bluntly: LLMs “don’t know what’s true. They just know what words sort of work together… They’re bullshitters until we can ground them in reality, a truth.” Sophisticated pattern matchers, not thinking machines — which is exactly why the harness around them has to do the work the model can’t. The scaling literature has been documenting theoretical ceilings on LLM scaling — five fundamental limitations including reasoning degradation and hallucination that the authors prove are intrinsic to the architecture, not artifacts of optimization or data curation. More parameters don’t grant lived experience, judgment, or the capacity to distinguish correlation from causation. Whatever does that work has to come from outside the model.
I’ve argued before that orchestration is the hard part — that most production AI failures are coordination and specification problems, not model capability problems. I still believe that, hard stop. The Berkeley MAST paper still holds: 79% of multi-agent failures are coordination and spec, not the LLM underneath. The harness is where the work lives. But orchestration only matters if the model can do what the harness is asking it to do, and the deployment economics of building the harness at scale are what this essay is actually about. The math forecloses replacement whether or not the technology could, in principle, get there with enough scaffolding. What’s clear is that the harness is necessary. What’s less clear — and what I’d argue is genuinely contested — is what the harness can and can’t fix.
What I keep watching, in my own work and in research, is one specific failure mode that lives at the boundary. When the orchestration doesn’t supply judgment about what the work is for, the model produces a clean artifact that only looks like success.
A research team tested frontier LLMs on writing unit tests for SAP HANA — a proprietary commercial codebase guaranteed to be absent from training data — against the open-source LevelDB as a control. When models hit compilation errors on HANA, they didn’t fix the underlying code. They commented out the assertions and generated empty test bodies. The tests compiled. The mutation score was near zero. Goodhart’s Law made literal: the measure became the target. Whether that’s an architectural limit or a missing quality gate is the contested question. What’s not contested is that the artifact looked correct.
I’ve watched smaller versions of this pattern across the agent workflows I run. The model drops the constraint it can’t satisfy and ships a clean artifact instead of a correct one. Sometimes I can fix it with better scaffolding — clearer specs, tighter QA criteria, trust tiers. Sometimes I can’t, and the model keeps producing artifacts that look right and aren’t. I don’t know with confidence where the line is between “my orchestration needed work” and “the model couldn’t do this no matter what I built around it.” Anyone who tells you they do know is selling you something.
What I do know is that the harness is necessary. Opper.ai tested 53 frontier LLMs on a simple question — “I want to wash my car. The car wash is 50 meters away. Should I walk or drive?” — and over 80% recommended walking. They missed the obvious physical causality that the vehicle has to be present to be washed. The harness has to supply the causal reasoning the model doesn’t have. The same architectural blindness that produces that recommendation produces the marketing brief that’s technically correct and strategically empty, the legal clause that pattern-matches a precedent that doesn’t apply, the dashboard analysis that confidently reports a number nobody can defend in the room. The harness can catch a lot of those. It cannot catch all of them. And the cost of building the harness that catches enough of them is the cost the math section already described.
Meanwhile, the people inside the harness are getting weaker. Microsoft and Carnegie Mellon surveyed 319 knowledge workers across 936 real AI use cases. 79% reported less effort on comprehension. The kicker is which direction the confidence runs. Higher confidence in AI was associated with less critical thinking. The researchers describe a shift from problem-solving toward verifying AI outputs — what they call moving from “task execution to task stewardship.” AI handles the messy, repetitive tasks that built professional judgment in the first place, and junior employees miss the chances to develop it. Organizations end up with managers who’ve never done the underlying work, and thin leadership pipelines.
The model can produce the artifact. It cannot have the lived experience that produced the artifact at scale, accountably, against a real customer, with the consequences a person bears. It cannot learn from being wrong out loud. It cannot curate, because curation requires taste — judgment about what to leave out — and taste comes from a thousand bad subject lines, not the thousand-and-first prompt.
The audit & the bench
The wins slide goes to the CFO before it goes anywhere else, and the CFO has the calculator. The savings figure runs net of inference, net of the management overhead, net of the procurement renegotiation that hit two quarters in, net of the crisis-PR hire after the chatbot ate years of brand equity. The new number is usually smaller. Sometimes it’s negative. Either way it doesn’t make the slide twice.
The seats deleted in the first round of AI mandates are gone. That math has cleared but the danger to the remaining bench is quieter. The AD who keeps shipping campaigns on AI-generated briefs and can’t tell you which ones missed. The PO who didn’t read their own PRD. The IC who built the dashboard from a Claude session and can’t defend a single number when the VP asks. They produce the artifact and skip the learning. A vacant seat gets backfilled in a quarter. A seat occupied by someone who stopped using their brain is a slower, more expensive failure, and it touches every part of the business it sits inside.
You can see the split forming in hiring. IBM is the loud version — they announced plans to triple entry-level hiring in 2026, with junior roles reshaped around customer engagement and verifying AI outputs. The quiet version is the marketing teams I’m watching do the same thing without a press release. They’ve trained up the junior to read the model’s output and push back. They promote the IC who can name what the AI got wrong, and adjust process accordingly. Nobody calls it a strategy. It’s just the bench they’re building.
The teams have avoided the audit entirely are those who already priced the cost in. They run the agents and do the work. They know which calls the model is allowed to make, which ones come back to a person, and what the person is supposed to have learned by the time the call gets to them. When a vendor walks in with a pitch deck and a savings figure, they ask the question their CFOs taught them to ask about ROAS: net of what?
Those teams will still have their seats. Not because anyone protected them. Because the math protected them. And the technology required them.
The machines will not replace us
The replacement thesis needed three things to be true at once:
Compute had to be cheaper than labor.
Cleanup had to be free.
Judgment had to be separable from the work that produces it. x
None of those things are true, and the gap between the thesis and the math is wide enough that the vendor is telling you about it on the record.
The savings line on the slide is the last place those three falsehoods still sit unchallenged. It survives because nobody has run the net yet. It does not survive the audit. Q4 always comes.
AI is a tool, the way ROAS is a tool. Useful when you know what it leaves out. Dangerous when you let it speak for the business. It makes some of the work faster and some of it better. What it cannot do is learn. It cannot have lived experience. It cannot replace judgment, or creativity, or curation.
The replacement thesis assumed those were separable. They are not.
Did you enjoy this newsletter?
Please like it by clicking on the ❤️ at the very top or bottom of this post. This really helps get this newsletter recommended to Substack’s recommended shortlists.
Or, if you enjoyed this, learned something new, and it will help you in any way, reply and tell me about it. If you loved it, you can always treat me to a coffee.






