Why now — same model, different harness

Four independent teams, four setups, same conclusion.

The harness moves the needle far more than the next model upgrade. The bars below show what changed when teams re-engineered the system around the model.

LangChain · 2026
+13.7 points on Terminal Bench 2.0
Same model. Harness-only tweaks. 52.8% → 66.5% pass rate.
52.8% → 66.5% harness only
Vercel v0 · 2026
−80% available tools improved completion
Trim the toolbelt; escape the "dumb zone" where the model burns tokens reasoning about tools.
−80% tools removed
Boris Cherny · Claude Code
2 to 3× output quality from verification
Effective verification methods (pre-stop hooks, golden tests) multiply final output quality.
2× — 3× quality multiplier
OpenAI Codex · Feb 2026
~1M lines, ~1,500 PRs, 0 hand-written
Five months. Real product. Humans built the harness; agents wrote the code.
1,000,000 lines · 0 hand-written harness shipped it
The pattern. None of these teams beat the next model. They beat the previous version of themselves by re-engineering the system around the model — fewer tools, tighter verification, durable session state, hand-written context. The harness is the leverage.