What Is an AI Harness? The Secret Behind 100x Productivity · Issue #83

Opening

On March 31, something hard to believe happened. Anthropic accidentally published the entire source code of its AI coding tool, Claude Code, to npm. 512,000 lines, 1,906 TypeScript files. Within hours of being discovered, it had been backed up to GitHub and forked more than 41,500 times, and it is still floating around the internet today.

Anthropic explained it away as “a packaging mistake,” but the developer community was buzzing for a different reason. The leaked code proved the long-rumored ‘secret behind the AI agent productivity gap.’ (I’ve written a column for ZDNet on this before.)

Garry Tan, CEO of Y Combinator, Silicon Valley’s largest startup accelerator, read the leaked 510,000 lines himself and wrote this.

“The secret is not the model. It is the thing wrapped around the model.”

Today’s issue is about that ‘thing’. Why, on the very same Claude, does one person top out at 2x productivity while another hits 100x? To be honest, this digging started from skepticism. I deeply distrust any “Nx productivity” claim. But this one kind of holds up. And it’s about how this will upend companies’ AI investment strategies going forward.

Same Model, 100x Gap: What Makes the Difference

Steve Yegge, an engineer with 40 years of experience, dropped a provocative number in a recent interview. An engineer who truly knows how to drive AI coding agents is 10 to 100 times more productive than an engineer using an ordinary chatbot — and roughly 1,000 times more productive than a Google employee in 2005.

Practitioners have long suspected these numbers are exaggerated. But Yegge nailed down one thing clearly: “the 100x-productive person and the 2x-productive person are using the exact same model.” Both run Claude Opus 4.6, and both hold the same API key.

The difference comes not from ‘intelligence’ but from ‘structure.’ And that structure is simple enough to fit on a single index card.

The Claude Code leak backed this claim with empirical data. What the developers who analyzed the 510,000 lines of source code found was not ‘a smarter model’ but ‘a more cleverly designed wrapper’. The leaked code contained 44 hidden feature flags, a three-tier ‘Self-Healing Memory’ system, and an autonomous agent daemon mode named ‘KAIROS’. The LLM model itself was nowhere in the code. In its place stood “a sophisticated architecture that delivers the right context to the model, at the right time, without noise.”

The industry calls this a harness¹ — named after the tack you strap onto a horse to steer it.

‘Thin Harness, Fat Skills’: The Architecture Inverted

The framework Garry Tan laid out is intuitive from the name alone. “Thin harness, fat skills.” The idea is the exact opposite of how the industry has been operating.

The conventional ‘thick harness’ approach goes like this: cram 40-plus tool definitions into the system prompt, wait 2 to 5 seconds for every MCP² server call, and wrap each individual REST API endpoint as a separate tool. The result? 3x the tokens, 3x the latency, 3x the failure rate. Half the model’s context window gets eaten by tool descriptions, leaving no room to actually solve the problem.

‘Thin harness, fat skills,’ by contrast, is structured like this.

A Skill³ is a reusable procedure document written in markdown. It’s a document that teaches the model not ‘what to do’ but ‘how to do it’. The example Tan gives is striking. There is a single skill called /investigate. It consists of 7 steps and takes three parameters: TARGET, QUESTION, DATASET.

Feed it one safety researcher and 2.1 million emails → it becomes a medical research analyst
Feed it shell companies and FEC campaign-finance filings → it becomes a forensic investigator Same markdown file, same 7 steps. Only the inputs changed. Tan described this as “software design that uses markdown as the programming language and human judgment as the runtime.” Not prompt engineering, in other words.

The Harness is the thin layer that drives the LLM. By Tan’s yardstick, about 200 lines. It does only four things: it runs the model in a loop, reads and writes files, manages context, and enforces guardrails. That’s all.

The final layer is deterministic tools. Things like SQL queries, compiled code, and arithmetic — anything where “the same input must always produce the same output.” There’s a line Tan emphasized.

“An LLM can seat eight people around a dinner table, accounting for personalities and social dynamics. But ask it to seat 800, and it hallucinates a seating chart that looks plausible but is completely wrong.”

Combinatorial optimization is a deterministic problem. Force it into latent space⁴ and it fails. Conversely, a judgment like “these two founders are in the same AI infrastructure space but aren’t competitors — one does cost attribution, the other does orchestration” is something embedding similarity search will never catch. “Which side you assign each task to” is the heart of system design.

To sum it up: intelligence goes up top (skills), execution goes down below (deterministic tools), and the harness stays thin. The most powerful property of this structure is this: every time the model gets upgraded, every skill automatically improves, while the deterministic layer at the bottom keeps running unchanged and stable.

Why This Matters Now: Anthropic’s Play to Open Up ‘Skills’

The leak was shocking not simply because source code went public. It’s because Anthropic was already moving to make this architecture the industry standard.

On October 16, 2025, Anthropic unveiled a feature called ‘Agent Skills’. Two months later, on December 18, it converted it into an open standard. It’s the exact same playbook that made MCP (Model Context Protocol) an industry standard. With nothing more than a single markdown file called SKILL.md and some YAML metadata, you can inject domain expertise into an AI agent. Microsoft, OpenAI, Cursor, GitHub, Atlassian, and Figma have already adopted the standard.

The crucial design principle here is Progressive Disclosure⁵. At the start, only the skill’s name and description are loaded into the system prompt (50–100 tokens). When the model decides “I need this skill,” it reads the full SKILL.md at that point, and any auxiliary files referenced inside are loaded only when needed in turn. It treats the context window “like a library: browse by the index, but pull books off the shelf only when needed.”

One episode Tan shared captures the core of this principle. He says he grew his Claude Code configuration file, CLAUDE.md, to 20,000 lines — trying to capture every pattern and lesson he’d ever learned. The result? The model’s attention dropped off a cliff. Claude Code itself told him, “trim this down.” The fix was a ‘pointer document’ compressed to about 200 lines. The 20,000 lines of knowledge stayed put; a resolver now pulls them in only when needed.

This pattern resembles what the hardware industry calls ‘layered cache’ design — keeping L1, L2, and L3 caches so that frequently used data sits close and rarely used data sits farther away. AI agents are now getting a ‘memory hierarchy of knowledge’ designed on the same principle.

Oswarld’s Take

The real explosive power of this framework is that it upends how companies structure their knowledge assets. For decades, companies have managed knowledge assets in two ways: as documents (e.g., Confluence, Notion, etc.) and as code (e.g., ERP, CRM, internal tools, etc.). Both had limits. Documents come alive only when a human reads them; code lacks flexibility and carries heavy maintenance costs. The ambiguous zone in between — say, ‘the sales team’s quoting process,’ ‘the legal team’s contract review criteria,’ ‘the marketing team’s brand guidelines’ — has always lived only inside people’s heads.

Skill files are a new way to turn this ‘ambiguous zone’ into an asset. Put the procedure, the judgment criteria, and examples into a single page of markdown, and it becomes a reusable organizational capability. That’s why SaaS companies like Canva, Stripe, Notion, and Zapier have already started publishing SKILL.md files describing how to operate their services. This is different in kind from API documentation. An API tells you ‘how to call it’; a skill tells you ‘when to call it, in what order, and through what judgment.’

What I find most striking in this shift is Tan’s phrase: “skills are permanent upgrades.” Conventional software accumulates tech debt⁶ the more you use it. Skills are the opposite. A skill written well once automatically gets better every time a new model ships. The judgment part upgrades with the model; the deterministic part stays stable. “Build it once and it runs forever,” as he puts it.

There is one point I want to make soberly, though. According to a report Snyk published in February 2026, 36.82% of publicly audited skills had security flaws. A malicious skill can create risks like data exfiltration and unauthorized system access. Skills must be treated as ‘code,’ not text — version control, designated owners, regular reviews, and so on. This is the unavoidable cost enterprises face when adopting skills.

The proposition that “the new asset of the AI era is not the model but the skill library” is romantic, but for now it remains ‘a premise that only holds if well managed.’ The organizations that establish this premise first will be the beneficiaries of the next productivity gap.

Closing

Whenever I see claims of 100x productivity or explosive growth, my default is suspicion. The reason is simple: to say “Nx,” you need a baseline number — and if there isn’t one, how can we believe the growth? Going from 1 person to 3 and calling it 3x surely carries a different weight from going from 100 people to 300 and calling it 3x. This newsletter started from exactly that puzzlement: put on a harness and productivity improves 100x — really? New terms will keep appearing: prompt engineering, context engineering, harness engineering… Don’t be intimidated. Instead, get crystal clear on how the thing actually works.

First, the essence of the AI productivity gap lies in architecture, not the model. What the Claude Code leak showed was not ‘a smarter brain’ but ‘a smarter structure.’

Second, ‘thin harness, fat skills’ is a shift in software design philosophy. Intelligence goes on top as markdown, execution goes below as code, and orchestration stays thin. This principle is already an industry standard embraced by Anthropic, Microsoft, OpenAI, and Cursor.

Third, corporate competitiveness is shifting from ‘how good a model you use’ to ‘how deep a skill library you own.’ But this asset-building brings homework along with it: security and governance.

If, while reading this, the question “which side is our organization’s AI adoption strategy closer to?” crossed your mind, that question alone is today’s newsletter earning its keep. Right now, the time spent finding which of your team’s repetitive tasks “become a permanent asset once written up in a single page of markdown” is more valuable than the time spent poring over model comparison spec sheets.

If the response is good, the next issue will cover the other side of this structure — “why organizations with a skill architecture can shake incumbent giants.” A good response means lots of comments and lots of shares! Share this newsletter with your friends and encourage them to subscribe!

References & Further Reading

Primary sources

Garry Tan, “Thin Harness, Fat Skills”, gbrain GitHub repository, 2026. : The original write-up of today’s framework. The three-layer ‘fat skills, thin harness’ architecture is compressed onto a single index card.
Anthropic, “Equipping agents for the real world with Agent Skills”, Anthropic Engineering Blog, 2025.12. : A detailed explanation of the SKILL.md standard and the Progressive Disclosure design principle. If you want to build a skill yourself, start with this document.
Steve Yegge, “The AI Vampire”, Medium, 2026.02. : An analysis of what 10x/100x productivity really is, and the ‘vampire effect’ hiding behind it (unsustainable beyond 3 hours).
Gergely Orosz, “Steve Yegge on AI Agents and the Future of Software Engineering”, The Pragmatic Engineer, 2026.02. — Contains Yegge’s ‘8-stage AI adoption model’ and insight into why large companies structurally cannot absorb this productivity.

Background

VentureBeat, “Claude Code’s source code appears to have leaked: here’s what we know”, 2026.03.31. : An analysis covering the technical details of the leak and the exposed ‘Self-Healing Memory’ and ‘KAIROS’ designs.
Zscaler ThreatLabz, “Anthropic Claude Code Leak”, 2026.04. : A report laying out the leak’s timeline and its security risks (especially in combination with the Axios supply-chain attack).
Snyk, “Agent Skills Security Audit Report”, 2026.02. (as cited in news reports) : The source for the finding that 36.82% of public skills contain security flaws. Worth consulting for any organization weighing skills as assets.

The author, Kwangseob Ahn, is a professor of business administration at Sejong University and lead consultant at OBF (Oswarld Boutique Consulting Firm). At the university he teaches statistics and data analysis, including business data management and business analytics, while in the field he leads GTM strategy and AI strategy consulting, designing the interface between technology and business. He has published academic research on memory architecture for AI dialogue systems (HEMA), and runs Daily Arxiv, a project curating global AI papers every day. He completed the master’s program at Korea University’s Graduate School of Management of Technology and its KMBA. He is the author of People Who Outsource Their Thinking: Homo Brainless.

Footnotes

Harness: the scaffolding that drives an LLM. It runs the model in a loop, reads and writes files, manages context, and enforces guardrails. The name comes from the tack you strap onto a horse. ↩
MCP (Model Context Protocol): an open standard Anthropic released in 2024 that standardizes how AI agents connect to external tools and data sources. It is complementary to Agent Skills: MCP handles the ‘connection,’ while Skills handle the ‘how-to.’ ↩
Skill file (SKILL.md): a reusable procedure document written in markdown. It teaches the model ‘how’ to do something, not ‘what’ to do. It consists of YAML metadata and a markdown body. ↩
Latent space vs. deterministic: latent space is the domain where the AI’s judgment and interpretation happen — the same input can yield a different output each time. The deterministic domain is where the same input always yields the same output, like SQL queries or arithmetic. The heart of system design is deciding ‘which tasks go on which side.’ ↩
Progressive Disclosure: a design principle of loading only what’s needed, when it’s needed, rather than everything at once. For AI agents, only a skill’s name/description is loaded up front, and the actual content is pulled in when needed. A key technique for using the context window efficiently. ↩
Tech debt: the long-term accumulated cost of code or design shortcuts taken for short-term convenience — a state in which later fixes and extensions become difficult. ↩