Why now — same model, different harness

Four independent teams, four setups, same conclusion.

The harness moves the needle far more than the next model upgrade. The bars below show what changed when teams re-engineered the system around the model.

LangChain · 2026

+13.7 points on Terminal Bench 2.0

Same model. Harness-only tweaks. 52.8% → 66.5% pass rate.

52.8% → 66.5% harness only

Vercel v0 · 2026

−80% available tools improved completion

Trim the toolbelt; escape the "dumb zone" where the model burns tokens reasoning about tools.

−80% tools removed

Boris Cherny · Claude Code

2 to 3× output quality from verification

Effective verification methods (pre-stop hooks, golden tests) multiply final output quality.

2× — 3× quality multiplier

OpenAI Codex · Feb 2026

~1M lines, ~1,500 PRs, 0 hand-written

Five months. Real product. Humans built the harness; agents wrote the code.

1,000,000 lines · 0 hand-written harness shipped it

The pattern. None of these teams beat the next model. They beat the previous version of themselves by re-engineering the system around the model — fewer tools, tighter verification, durable session state, hand-written context. The harness is the leverage.