Providers Just Started Metering AI by the Token. NVIDIA Just Put 128GB of Unmetered Inference on Your Desk.
Two announcements landed within a few days of each other this spring, and most coverage treated them as unrelated. One was a billing change. The other was a chip. Read together, they describe the same turning point: the industry began charging for inference by the unit at the precise moment the hardware to make a unit of inference free arrived on a laptop.
The Meter
Anthropic moved enterprise customers from per-seat subscriptions to per-token pricing with monthly spend commitments. Platform seat fees, roughly $20 per user each month for Claude Code and around $10 per user each month for the Claude.ai business tier, now cover access only; the work itself bills by consumption at API rates, and the change reportedly removed the prior 10 to 15% API volume discounts. This was not one vendor acting alone. OpenAI moved Codex from a flat rate to token metering in early April 2026, and GitHub tightened Copilot usage limits the same month, each shifting more of the cost onto consumption. It is part of a broader move toward rationing and usage-based pricing across the industry.
It is tempting to read this as vendors turning the screws, but the simpler explanation is that the commercial model grew up. Flat per-seat pricing was a user-acquisition subsidy that made sense when the goal was adoption and the marginal cost of a query was being absorbed by investors. Metering is what a mature infrastructure business looks like: you pay for what you consume, the way you pay for compute, storage, and bandwidth. The problem for buyers is not that the model is unfair. The problem is that it is consumption-elastic, and AI consumption is climbing in a way that does not track the per-unit price at all.
The Bill That Goes Up While the Price Goes Down
Here the data does something counterintuitive. Per-token prices have collapsed. The cost of LLM inference fell from roughly $60 per million tokens for GPT-3 in late 2021 to about $0.06 for Llama 3.2 3B in 2024, a thousandfold drop in three years, which a16z characterized as faster than compute cost during the PC revolution or bandwidth during the dotcom boom. By any per-unit measure, intelligence is getting cheaper at an extraordinary rate.
And yet enterprise bills are exploding. Enterprise generative-AI spend reached $37 billion in 2025, up from $11.5 billion the year before, a 3.2x jump, with coding the single largest application category at $4.0 billion. The reconciliation is straightforward once you name it: this is Jevons paradox applied to tokens. When a resource gets dramatically cheaper, you do not consume the same amount for less money; you consume far more of it. Agentic workloads are the accelerant, because an agent that plans, calls tools, retrieves context, and revises its own output consumes 5 to 30 times more tokens per task than a standard chatbot, a multiplier Gartner reported in March 2026. Falling prices induce demand faster than the prices fall, so the bill rises even as each token gets cheaper. As one CIO put it in the analyst Josh Bersin's recent write-up on rising AI prices, "I am preparing myself to be surprised by the bills."
The strategic consequence is the part that gets missed. If your spend is governed by induced demand rather than by unit price, you cannot price-shop your way to safety; a cheaper model just gets used more. The only structural escape from a consumption-elastic bill is to fix the marginal cost of a token at zero, and the only way to do that is to own the hardware the token runs on.
The Desk
That hardware just showed up. NVIDIA used its Computex keynote to announce the RTX Spark line of Windows PC chips, pairing a 20-core Grace CPU with a Blackwell RTX GPU, up to 128GB of unified memory, and roughly a petaflop of AI compute. NVIDIA states the platform runs 120-billion-parameter models locally with up to a million tokens of context, shipping this fall in laptops and desktops from ASUS, Dell, HP, Lenovo, Microsoft Surface, and MSI. Jensen Huang framed it plainly: "The PC is being reinvented... you ask, and the PC does the work."
The rhetorical hinge came from Microsoft. Satya Nadella said the goal is "to deliver unmetered intelligence to every home and every desk with Windows." That word, unmetered, is the direct answer to the meter. The pricing of the consumer line was not in the release, and I would not put weight on leaked numbers, but there is a real data point already on the table: NVIDIA's DGX Spark developer box, with the same 128GB and roughly a petaflop, sells for $3,999 today. A capable local-AI machine is a four-figure purchase, not a capital project.
The reason 128GB can run a 120-billion-parameter model deserves a sentence, because the headline sounds impossible otherwise. Two things make it work. The first is quantization: storing model weights at 8-bit precision rather than 16-bit roughly halves the memory a model needs. The second is mixture-of-experts architecture, where a model like OpenAI's open-weight gpt-oss-120B has roughly 120 billion total parameters but activates only about 5 billion of them per token. You store the whole model and pay the compute cost of a small one. That is why a single workstation now handles what required a server rack two years ago.
What the PC Revolution Actually Taught
It is easy to take this story too far, and the cleanest correction comes from someone who lived through the last version of it. Steven Sinofsky, who ran Windows at Microsoft, has been arguing that AI is shifting from the cloud toward devices. The temptation is to hear that and conclude the cloud is finished, the way people in the 1980s predicted the PC would kill the mainframe and the data center.
Sinofsky's own reading of that history is the opposite, and it is the discipline this thesis needs. The PC was supposed to "eliminate mainframe computing and the data center. HAHA. Everyone was wrong all around," he wrote. What actually happened is that PCs proliferated and data centers expanded at the same time; people "first connected their PCs to data centers, and then just replaced the hardware in data centers with PC hardware." The lesson is not substitution. The lesson is that capability moved to both ends of the wire at once, and the total amount of computing went up.
So the right word for what is coming is not migration but bifurcation. Inference splits by workload. Cost-sensitive, latency-sensitive, and privacy-sensitive tasks have a strong reason to run on-device, while frontier reasoning and all model training stay in the cloud, because that is where the capability still lives. The open-weight models you can run locally are genuinely close on the dimensions that matter most for routine agentic work. On coding benchmarks like SWE-bench Verified and Terminal-Bench, the strongest open-weight models have effectively caught the closed frontier. Where the closed models still hold an edge is the hardest reasoning and graduate-level science, where the same analysis puts the gap at roughly 3 to 8 percentage points on most benchmarks and closer to ten on GPQA Diamond. That gap is precisely the line along which the split will fall: the local machine handles the high-volume, well-scoped middle of the distribution, and the cloud keeps the hard tail.
The Hedge, and Where the Data Goes
For a team running agentic workloads at scale, this is a concrete option rather than a thought experiment. The break-even intuition is directional, and I want to be clear it is illustrative rather than a quote: a roughly $4,000 machine is a one-time cost, while a metered agent doing real work is a recurring one. The more tokens your agents burn, the faster a fixed hardware cost amortizes against an elastic bill, and agentic volumes are exactly the case where that crossover arrives quickly. You are not betting that the cloud gets more expensive. You are buying an asset whose marginal token cost is zero against a line item that grows with your own success.
The barrier to testing this used to be an infrastructure program. It is now a weekend. Ollama, which has crossed 170,000 GitHub stars and serves more than 52 million model pulls a month, runs quantized open-weight models on macOS, Linux, and Windows with a single command. The distance between curiosity and a working local model is an afternoon, which changes who gets to run the experiment from the platform team to any engineer with a capable laptop.
There is a second reason this matters for anyone responsible for data, and it has nothing to do with the bill. When inference runs on the device, the prompt and its context never leave the machine. No payload crosses a vendor boundary, which collapses an entire category of data-residency and confidentiality questions for regulated or sensitive workloads. Inference that physically runs on hardware you control is the only version of "private" that fully holds; a model reached through a vendor's virtual private cloud still sends plaintext across a trust boundary. That same on-device property is why local runtimes deserve governance attention rather than blind enthusiasm: a model an engineer pulls onto a laptop is, by default, a runtime nobody catalogued, which is the shadow-IT exposure I worked through in the self-hosted AI questionnaire post. The same blind spot shows up at the platform layer, where you often cannot say with confidence which models are actually running in your environment. Local inference is a privacy gain and an inventory problem in the same motion.
The move worth making this quarter is small and specific: pick one high-volume agentic workflow, run it against gpt-oss-120B or Llama 3.3 70B on a 128GB machine through Ollama, and measure the token volume it would have metered in the cloud. You will end that test knowing your own crossover number, and you will be holding a hardware hedge against a billing model that grows with everything your agents do.