Harness Engineering — The Discipline Around the Model

One-line idea: Agent = Model + Harness. The model is the brain; the harness is the skeleton, memory, hands, immune system, and nervous system around it.

Who this is for: an exec who needs the analogy and the decision (sections 1-3); a tech lead who needs the framework (sections 4-5); an engineer who needs the patterns (sections 6 onward).

Vendor stance: neutral. Examples reference OpenAI Codex, Anthropic Claude Code and Managed Agents, MCP, LangChain, Vercel v0, and HumanLayer where the public evidence comes from those teams.

Prefer a slide deck? ▶ Open in Present mode — single-viewport, 10 slides, keyboard-navigable.

The shift Leverage moved from the model to the harness.

In 2025 the conversation was prompts. In early 2026 it was context. By mid-2026 the leading teams stopped arguing about either and started shipping a different artifact — the harness.

Hard evidence +13.7 points on Terminal Bench, harness-only.

LangChain raised a coding agent from 52.8% to 66.5% by tweaking the harness alone — same model. None of the four 2026 case studies in this deep-dive beat the next model; they beat the previous version of themselves.

The framework Five layers, five owners, five rates of change.

Constraint, Context, Execution, Verification, Lifecycle. Anthropic Managed Agents shipped 3 of the 5 in April 2026; the developer still owns L1 and L4 — and on a managed platform, that's exactly where your next dollar should go.

The compounding rule The Agentic Principle.

"Anytime an agent makes a mistake, engineer it so the agent never makes that mistake again." A team that runs for six months ends up with a harness no one else can copy quickly.

How to read this page

Each scene below is a short markdown intro plus an interactive visual. Scroll naturally; click anything inside the visuals to drill in. An exec can stop after Scene 3, a tech lead after Scene 5, an engineer should reach Scene 9.

Scene 1 — The equation

Agent = Model + Harness. The model is the brain. The harness is everything that makes the brain useful — the skeleton, memory, hands, immune system, and nervous system wrapped around it.

“When everything is harness, nothing is. The word ‘harness’ today is where ‘networking’ was before the OSI model. Everyone agreed networks mattered, but nobody could have a precise conversation about which part was broken.”

This deep dive gives you a precise way to talk about it.

Scene 2 — Without vs With

Same model. Same brain. Two completely different outcomes. On the left, the brain drifts, loops, and bluffs because nothing keeps it bounded. On the right, the brain sits inside a vehicle — steering, brakes, dashboard, guardrails, and a review stop turn raw reasoning into something the organization can actually use.

The model isn’t the bottleneck. The missing harness is.

Scene 3 — Why now

Four independent teams in 2026, four very different setups, same conclusion: the harness moves the needle far more than the next model upgrade does. None of them beat a frontier model — they beat the previous version of themselves by re-engineering the system around the same model.

That’s the shift. The discipline now has a name.

Scene 4 — The 5 layers

The cleanest formalization comes from Anthropic’s Managed Agent post (April 2026): a harness has five layers, each with a different purpose, owner, and rate of change. Click any layer below to see its examples and minimum viable version. Switch the metaphor (Body / Kitchen / Car) to match your audience.

Anthropic shipped 3 of the 5

In April 2026, Anthropic Managed Agents shipped L2 (Context), L3 (Execution), and L5 (Lifecycle) as platform infrastructure — durable session logs, sandboxed execution, MCP routing, crash recovery. P50 time-to-first-token dropped ~60%, p95 over 90%. L1 (Constraint) and L4 (Verification) stay the developer's responsibility — and on a managed platform, that's exactly where your next dollar should go.

Scene 5 — One governed run

What does an agent run actually look like when the harness is doing its job? Nine steps in three acts: Prepare the work, Execute inside policy, Govern the outcome. Click any step to see what the harness is doing, what evidence proves it ran, and what fails when this step is missing.

The shape of a real run, not a demo.

Scene 6 — Failure modes

When the harness is missing a layer, specific failures show up in production. Tap any failure to see the symptom you’d recognize and the layer that should have caught it. Most teams fight the symptoms; the harness fights the root cause.

If you’ve watched any of these happen in production, you’ve seen exactly which layer was empty.

Scene 7 — Where to invest

You can’t build all five layers at once. The decision tree on the left tells you what to do Monday based on whether you’re on a managed platform or self-hosting. The heatmap on the right tells you what most teams have actually built — and where the empty cells are.

Default answer for most teams: ship L4 (Verification) first. Fastest path from “demo works” to “production reliable.”

Scene 8 — Three decision lanes

The right boundary is not “AI or no AI.” It’s three lanes — Agentic for adaptive tool-using work, Deterministic for known-path rule-heavy work, and Human authority for approval, exceptions, and risk acceptance. A harness only earns its keep in lane 1.

Picking the wrong lane is how teams overspend on harnesses for work that should have been a script — or quietly hand authority to an agent that should have stayed with a human.

Scene 9 — Implementation playbook

One minimum-viable move per layer. Click any layer tab to see the goal, two or three example patterns, the anti-pattern that wastes everyone’s time, and a code/config example you can ship this sprint.

L4 first, on self-hosted One pre-stop hook moves the needle 2-3×.

Boris Cherny on Claude Code: effective verification methods multiply final output quality. A single hook that runs your existing test command and exits non-zero on failure is the cheapest reliability win you can ship.

L1 first, on a managed platform Custom linters with actionable Fix: lines.

OpenAI Codex invested more in this layer than any other. The leap was making every violation tell the agent exactly how to fix it. Wikis don't enforce rules; linters do.

L2 — never trust LLM-written agentfiles Hand-written under 100 lines.

An ETH Zurich study of 138 agentfiles found that LLM-generated agentfiles hurt performance and cost 20%+ more in tokens. HumanLayer's working CLAUDE.md is under 60 lines. Curated and relevant beats comprehensive and stale.

Scene 10 — Vocabulary and sources

The Principle in a sentence, 13 flippable vocabulary cards (tap to flip), and a sources strip — all in one closing scene.

Principle

Principle: "Anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again." Every other rule in this deep dive is downstream of that one.

How this relates to the older 7-component view

You may see an older OpenAI-derived framing that lists seven harness components: control loop, state management, memory, tools and sandboxing, context management, planning and self-verification, and error handling. That view isn’t wrong — it’s the same system at a different altitude. The 5-layer model is the architectural spine; the 7 components are how the work gets done inside it.

5-layer view (this deep dive)	7-component view (older framing)
L1 Constraint	(Implicit — encoded in linters and structural rules, not in any single component)
L2 Context	Context management, memory
L3 Execution	Tools and sandboxing, control loop
L4 Verification	Planning and self-verification
L5 Lifecycle	State management, error handling

If your team already speaks 7-component, keep speaking it. The 5-layer model is just easier to assign owners and rates of change to, which is why it’s becoming the dominant vocabulary.

When NOT to invest in a harness

A harness costs real engineering time. Five situations where it’s the wrong move:

Prototyping

Throwing the code away in two weeks? Skip it. Prove the idea first; harness it once it has to live.

Single-shot tasks

One model call, one answer, no loop. There's no multi-step compounding failure to defend against.

Spaghetti monoliths

A codebase with no module boundaries can't be harnessed cheaply. Refactor (or hire humans) before you ask an agent to behave well in there.

Still evaluating runtime platforms

Don't build deep harness on a platform you might leave in two months. Wait until the platform decision is made.

Team too small

Five layers need cross-functional ownership. If you can't staff that across architecture, dev, platform, QA, and SRE, focus on L4 only and revisit later.

Bottom line

Harness investment compounds; harness debt compounds faster. But you only earn the compounding if the work is real, repeated, and multi-step.

Sources

Primary references those articles draw on:

OpenAI, Harness engineering: leveraging Codex in an agent-first world (February 11, 2026).
Anthropic, Building effective agents (December 2024) and Building a C compiler with a team of parallel Claudes (February 2026).
Anthropic Managed Agents platform announcement (April 2026).
LangChain Terminal Bench 2.0 results, HumanLayer practical guides, Vercel v0 tool-scoping write-up, ETH Zurich agentfile study, Boris Cherny on Claude Code verification — find these via the three synthesis articles above.

This deep dive is vendor-neutral. The patterns work on any modern coding agent — Claude Code, Codex, Cursor, Copilot Workspace, or your own — provided you name the five layers, give each one an owner, and run the Principle for long enough to compound.