The Price of One Token Changes Everything · Issue #69

Opening

Dear subscriber, have you heard of open models? They go by many names — local models, on-premises models, and so on — but simply put, they’re AI models you can install and run on your own computer (or server) without an internet connection.

Last week Google announced Gemma 4. Apache 2.0 license, 31B parameters, 86.4% on the agentic tool-use benchmark (τ2-bench). Considering that Gemma 3, just one generation earlier, scored 6.6% on the same test, this isn’t incremental improvement. It’s a massive leap.

Around the same time, the team behind the AI agent framework Deep Agents published its own evaluation results. Open models like GLM-5 and MiniMax M2.7 showed accuracy comparable to closed frontier models like Claude Opus 4.6 and GPT-5.4 on core agentic tasks — file manipulation, tool calling, instruction following.

But what I’m watching isn’t the benchmark scores. (I’ve said it many times: I’m a benchmark skeptic.) It’s the price tags. When the cost of doing the same work differs by a factor of 10 to 20, that’s not tech news — it’s a signal that the structure of the industry is changing.

🔢 What the Numbers Say: The Real Cost Gap

Put the Deep Agents evaluation data next to the prices and the landscape changes.

Start with agentic task accuracy (Correctness). Claude Opus 4.6 leads at 0.68, with GLM-5 right behind at 0.64 — actually higher than GPT-5.4 (0.61). MiniMax M2.7 sits at 0.57, not far behind Gemini 3.1 Pro (0.65).

Now attach the prices. On output tokens, Opus 4.6 costs $25 per million tokens. GLM-5 is $3.15, MiniMax M2.7 is $1.20. For the same work, the cost differs by a factor of 8 to 20.

Convert that to production scale and it gets even sharper. Suppose you run an agent system that outputs ten million tokens a day: Opus 4.6 costs $250 a day, MiniMax M2.7 costs $12. In annual terms, that’s a gap of about ₩87 million (~$60,000). A difference this size isn’t “cost savings” — it’s the difference between “can” and “cannot.”

🧩 The New Equation Gemma 4 Introduces

Gemma 4, which Google released on April 2, adds one more variable to this cost equation: where you run it.

Look at the Gemma 4 lineup and the intent is clear. E2B (2B) and E4B (4B) are for smartphones and browsers. The 26B MoE¹ activates only 3.8B parameters at inference time yet posted an LMArena text score of 1,441. The 31B Dense fits on a single consumer GPU while ranking third among open models worldwide.

What stands out most is the agent-related performance. On τ2-bench (an agentic tool-use benchmark), Gemma 3 27B scored 6.6%; Gemma 4 31B hit 86.4%. A 13x jump in a single generation. Native function calling², structured JSON output, and multi-step planning are built in, so you can assemble agent workflows without a separate framework.

And these models ship under the Apache 2.0 license. No restrictions on commercial use, no monthly active user caps. Put Gemma 4 31B on your own servers to run internal agents, and your entire cost isn’t API bills — it’s GPU electricity.

According to Artificial Analysis data, the API cost of Gemma 4 31B is $0.20 per million tokens on Lightning AI. Compared with Opus 4.6’s $25, that’s a 125x difference. There is a performance gap, of course — but for tasks where “good enough” is all you need, like an agent’s repetitive tool calls, that price difference is decisive.

🤖 The Real Problem of the Agent Era: Burning Lots of Tokens

Let’s go one step deeper here. Cheap open models are nice — but why has this become important now?

The answer lies in the token consumption structure of agent workflows.

A regular chatbot makes one LLM call per user question. But an agent is different. To handle a single user request, it loops through planning → tool selection → execution → verification → self-correction, making 10 to 20 LLM calls. According to Gartner’s March 2026 analysis, agentic models consume 5 to 30 times more tokens than regular chatbots.

Hand one software engineering task to an agent and it burns 1 million to 3.5 million tokens, retries and self-correction loops included. Run that on a frontier model and a single task vaporizes $5 to $8. In a production environment processing thousands of tasks a day, that can reach hundreds of millions of won — hundreds of thousands of dollars — a month.

According to the FinOps Foundation’s 2026 report, the average enterprise AI budget jumped from $1.2 million a year in 2024 to $7 million in 2026. And yet unit token prices keep falling: between 2024 and 2026, the median token price dropped at a rate of 200x per year.

That’s the paradox. Token prices are falling, but total costs are rising — because the volume of tokens agents consume is outpacing the rate of price decline. In this situation, the price advantage of open models isn’t a simple saving; it becomes the dividing line between whether you can put agents into production at all.

🏗️ Where the Value Is Moving

Step back from this trend and a bigger picture comes into view. Open models starting to work on agentic tasks means the center of value in the AI industry is shifting — from “who builds the smarter model” to “who weaves models together better.”

Deep Agents’ approach illustrates this well. The framework is designed so you can swap models with a single line of code. It supports multi-model patterns too — plan with a frontier model, execute with an open model. It automatically adjusts its compression strategy to each model’s context window size, and it injects the model’s name and capabilities into the system prompt so the agent knows what it can do.

The key here is the harness. Not the model itself, but the orchestration layer³ that wraps the model and makes it usable for real work — that’s becoming the center of value. So is the harness itself some monumental asset? Not really. A harness is ultimately a heavily modded tool — JSON and Markdown stitched together, then loaded with permissions and specialized knowledge… something like that. A harness is, in the end, a souped-up skill.md. It’s a structure much like the smartphone industry, where the iOS and Android ecosystems created more value than the semiconductors themselves. We’ll keep seeing things like this.

Deloitte’s January 2026 report framed this as “Tokenomics.” Enterprise AI costs are no longer measured in subscriptions or virtual machines; a new economic regime is opening in which they move in a variable unit called the token. In this regime, competitiveness comes not from “the capital to afford expensive models” but from “the architecture that extracts more value per token.”

Oswarld’s Take

First things first: if you’re an iPhone user, tap [iOS], and if you’re on Android, tap [AOS] to install Google’s newly released Google AI Edge Gallery, then download the Gemma 4 model I described above and try it yourself. It’s about 3.2GB, and in my experience the performance feels roughly like GPT-4o. It supports Korean, naturally, and it’s multimodal with image and voice recognition. What does this mean?

Until now, the structure of the AI market was clear. OpenAI, Anthropic, and Google build the models; companies buy the APIs. It resembles the early SaaS market: the side that owns the platform holds pricing power, and customers are locked in. To me, the most interesting angle on this shift is GTM strategy. But now open models have crossed a certain threshold — and been optimized on top of it.

Once open models move past “usable” to “production-ready,” this structure wobbles. Companies gain the option of bringing core workflows onto their own infrastructure. Gemma 4 E2B running on a smartphone and the 31B fitting on a single consumer GPU mean the decentralization of AI inference has become technically feasible.

From a data perspective, one more thing I want to flag: how to read these benchmarks with care. The Deep Agents evaluation is based on 138 test cases. Gemma 4’s τ2-bench score comes from a specific scenario (Retail). These numbers don’t reflect all the complexity of real production environments. The accurate reading isn’t that open models are “equal to frontier” but that they are “competitive enough on specific tasks.”

But the direction is clear. Falling token prices, rising agent token consumption, improving open model performance. At the point where these three axes meet, the range of organizations that can ‘own’ AI is fundamentally widening.

Closing

Open models (GLM-5, MiniMax M2.7, Gemma 4) have reached a level where they can compete with closed frontier models on core agentic tasks — at costs 8x to 125x lower.
Agent workflows consume 5 to 30 times more tokens than chatbots. In this environment, differences in token price determine not “savings” but “feasibility.”
The center of value in the AI industry is moving from “those who build models” to “those who weave models well.” Models become interchangeable parts, and orchestration becomes the competitive edge. Next time you review the cost of an AI tool, compare by the value extracted per token, not by the model’s name. The landscape will look different.

Thank you for reading today. This newsletter is a subscriber-only newsletter.

Spread the word and help grow the subscriber base — it’s great motivation for my writing!

References & Further Reading

Langchain, “Open Models have crossed a threshold”, 2026.4.2. : Contains the core data behind today’s newsletter — the open model vs frontier model agent benchmark results.
Google DeepMind, “Gemma 4: Byte for byte, the most capable open models”, 2026.4.2. : The official Gemma 4 announcement blog, including benchmark figures and architecture details.
Google AI for Developers, “Gemma 4 model overview”, 2026.4.2. : Memory requirements by model size, quantization options, and deployment guides.
Oplexa, “AI Inference Cost Crisis 2026: Why Your AI Bill Is Exploding”, 2026. : Covers how inference costs came to account for 85% of enterprise AI budgets and the structure of agent token consumption.
Deloitte Insights, “AI tokens: How to navigate AI’s new spend dynamics”, 2026.1. : A report analyzing how enterprises should manage AI costs in a token-based economic regime.
Zylos Research, “AI Agent Cost Optimization: Token Economics and FinOps in Production”, 2026.2. : A practitioner-oriented look at the token consumption structure of agent workflows and model routing strategies.
Hugging Face, “Welcome Gemma 4: Frontier multimodal intelligence on device”, 2026.4. : A technical analysis of Gemma 4’s architecture and its integration into the open-source ecosystem.

The author, Kwangseob Ahn, is a professor of business administration at Sejong University and lead consultant at OBF (Oswarld Boutique Consulting Firm). At the university he teaches statistics and data analysis, including business data management and business analytics, while in the field he leads GTM strategy and AI strategy consulting, designing the interface between technology and business. He has published academic research on the memory architecture of AI conversational systems (HEMA) and runs Daily Arxiv, a project that curates global AI papers every day. He completed the master’s program at Korea University’s Graduate School of Management of Technology and its KMBA. He is the author of Homo Brainless: The People Who Outsource Their Thinking.

Footnotes

MoE (Mixture of Experts): An architecture that houses multiple “expert” networks inside one AI model and activates only some of them depending on the input. The model has 26B parameters, but only 3.8B are used at actual inference time — so it runs fast like a small model while keeping the performance of a large one. Think of a buffet: the whole spread is laid out, but each guest eats only part of it. ↩
Native Function Calling: A built-in capability that lets an AI model directly call external tools (APIs, databases, search engines, and so on). It used to require a separate framework; with the capability built into the model itself, building agents has become far simpler. ↩
Orchestration Layer: An intermediate software layer that connects and coordinates multiple AI models, tools, and data sources. Like a conductor tuning each instrument in an orchestra, it manages which model gets which task and how the results are combined. Frameworks like Deep Agents play this role. ↩