The Rise of Harness Engineering

TL;DR

A raw language model is not an agent. It becomes one only when surrounding infrastructure gives it state, tools, memory, feedback, and constraints. That infrastructure is the harness.
Agent = Model + Harness. Prompt engineering formalized how we talk to the model; harness engineering formalizes how we make the model work.
The gap between a demo and a production agent is almost entirely a harness problem, and most of those failures have harness-level solutions, not model-level ones.
The question for teams building agents is no longer which framework to use. It is what your harness looks like, and whether you engineer it with the rigor you bring to the model.

Here is a counterintuitive finding from the latest Terminal-Bench 2.0 leaderboard: one team moved a coding agent from the bottom 30 to the top 5. Same model, same weights, zero retraining. They only changed the harness.

That result should reframe how every engineering team thinks about AI agents. We have spent years optimizing models. The real leverage now is in everything else.

◆ Background · Terminal-Bench 2.0

What the leaderboard actually measures

Terminal-Bench 2.0, built by the Laude Institute, scores AI agents on their ability to do real work inside a containerized terminal: compiling code, debugging async bugs, setting up servers, running data-science workflows, and resolving security vulnerabilities. It spans 89 human-validated tasks, each attempted five times, with execution handled by the Laude Institute's Harbor framework in Docker. Because every task is a multi-step, tool-driven workflow rather than a single answer, it measures operational reliability, exactly the surface a harness governs. That is why the same model can finish near the bottom or near the top depending on the scaffolding around it.

What is an agent harness?

A raw language model is not an agent. It becomes one when surrounding infrastructure gives it state, tool execution, memory, feedback loops, and enforceable constraints. That surrounding infrastructure is the harness: every piece of code, configuration, and execution logic that is not the model itself.

The equestrian analogy is apt. A horse is powerful, but without reins, a saddle, and a bridle it goes wherever it pleases. The model is the horse. The harness channels its power. The engineer is the rider who iterates on the harness to steer outcomes.

Concretely, a harness includes system prompts, tool schemas and invocation logic, bundled infrastructure (filesystem, sandbox, browser, bash), orchestration logic for sub-agents and handoffs, and middleware hooks for compaction, linting, and verification.

The formula is simple:

Agent = Model + Harness.

If you have followed the evolution of prompt engineering, you already have the intuition for this. Plain English instructions to a model are just text. But add structure, roles, examples, and chain-of-thought scaffolding, and you get prompt engineering: a discipline for systematically improving model inputs. Harness engineering is the same conceptual leap applied to everything around the model. Tools, MCPs, skills, guardrails, code execution sandboxes, human-in-the-loop validation, memory systems: left as ad hoc additions they are just plumbing. Organized into a deliberate, engineered system, they become a harness.

Prompt engineering formalized how we talk to the model. Harness engineering formalizes how we make the model work.

◆ Background · the top-30-to-top-5 result

The numbers behind the opening claim

The result that opens this piece comes from LangChain. In Improving Deep Agents with harness engineering, the team kept the underlying model fixed and iteratively tuned just three things: system prompts, tools, and middleware hooks. Their deepagents-cli rose 13.7 points, from 52.8 to 66.5 on Terminal-Bench 2.0, moving from roughly the bottom 30 into the top 5. Two of the levers map directly onto the failure modes below: a PreCompletionChecklistMiddleware that intercepts the agent before it exits and forces a verification pass, and a LocalContextMiddleware that maps the working environment upfront so the agent does not waste turns rediscovering it.

Why models need harnesses

The gap between a demo and a production agent is almost entirely a harness problem. Models hallucinate, lose coherence over long contexts, attempt to exit tasks prematurely, and replicate bad patterns they find in a codebase. None of these are solvable with better prompting alone. They require structural, enforceable controls at the harness level.

Consider the most common failure modes in production:

Early Victory Declaration. After building a few features, the agent surveys its progress and declares the project complete, even though dozens of requirements remain unimplemented.

One-Shotting. The agent tries to implement everything at once, runs out of context mid-feature, and leaves the next session to guess what happened.

Context Rot. As the context window fills with accumulated tool output noise, reasoning quality degrades measurably.

Architecture Drift. The agent replicates whatever patterns it finds, including bad ones, because it has no signal about which patterns are intentional.

Each of these has a harness-level solution, not a model-level one.

The control theory framework

Birgitta Böckeler at Thoughtworks brought a clarifying lens to harness design by framing it through control theory. Every harness control is either a guide (feedforward) or a sensor (feedback).

Guides steer the agent before it acts. They increase the probability of good results on the first attempt. Examples include convention files like AGENTS.md, architecture decision records, skill documents, and bootstrap scripts. They are the equivalent of road signs and lane markings.

Sensors observe after the agent acts and help it self-correct. Linters with custom LLM-friendly error messages, type checkers, test suites, browser automation, and AI review agents all fall here. The best sensors embed self-correction instructions directly in their output, a productive form of prompt injection.

Use only guides and the agent encodes rules but never learns whether they worked. Use only sensors and it repeats the same mistakes. A robust harness needs both, layered so each compensates for the other's weaknesses.

Böckeler's control-theory framing: guides increase the odds of a good first attempt, sensors catch what slips through.

◆ Background · the guides-and-sensors lens

Where the control-theory framing comes from

Birgitta Böckeler, a Distinguished Engineer at Thoughtworks, set out this vocabulary in Harness engineering for coding agent users on martinfowler.com. She organizes controls along two axes. The first separates guides (feedforward, steering before action) from sensors (feedback, correcting after action). The second separates computational controls, which are deterministic, fast, and cheap (a type checker, a structural test), from inferential ones, which are semantic, slower, and non-deterministic (an LLM-as-judge or AI reviewer). It is the first serious conceptual scaffolding the discipline has had.

These controls further split by execution type. Computational controls (linters, type checkers, tests) are deterministic, fast, and cheap enough to run on every change. Inferential controls (AI code review, "LLM as judge") are semantic, slower, and non-deterministic, but they catch issues that no static analysis can.

Solving the multi-session problem

The hardest unsolved problem in agent engineering is long-horizon work. Each new session begins with no memory of what came before. Imagine staffing a software project with engineers who work in shifts, where every new engineer arrives with complete amnesia about the previous shift.

Anthropic's engineering team found an elegant two-agent pattern to address this:

An Initializer Agent runs once at the start, producing a structured feature_list.json with every requirement marked as failing, a progress log for cross-session handoffs, a bootstrap script, and a clean git baseline.

A Coding Agent runs in every subsequent session. Its harness-enforced onboarding sequence is strict: read the progress file and git log, run the bootstrap script, smoke-test the environment, fix any regressions, then pick exactly one failing feature to work on. Commit before the session ends. Update the progress log. Leave a clean state.

The feature list is JSON, not Markdown, because agents rewrite prose specs inappropriately. The model is strongly instructed it may only toggle the passes field, never delete or edit features. This transforms a vague spec into a machine-readable contract.

Harness primitives that matter

Several primitives have emerged as essential building blocks.

Compaction solves context rot by intelligently summarizing the context window so the agent can continue working. Tool call offloading keeps the head and tail of large outputs and moves the full content to the filesystem.

The Ralph Loop intercepts the model's exit attempt via a hook and reinjects the original prompt in a clean context window, forcing the agent to continue against a completion goal. The filesystem makes this possible: each iteration reads state from the previous one.

◆ Background · the Ralph Loop

A bash one-liner that became a pattern

The Ralph Loop was created by Geoffrey Huntley in mid-2025 and later popularized inside Claude Code. The idea is deliberately crude: keep feeding the same prompt to the agent on a loop, so the prompt stays fixed while everything around it changes, the codebase, the test results, the git history, a progress file. A Stop hook intercepts the model's attempt to exit and reinjects the task, and each fresh iteration reads the previous one's state from disk. The agent gradually self-corrects until it meets the completion criteria or hits an iteration cap. It is a vivid illustration of the article's thesis: leverage living entirely outside the weights.

Skills as progressive disclosure solve the startup problem. Loading too many tools into context on initialization degrades reasoning before the agent begins any work. Skills load only front-matter initially and reveal full capability on demand.

Verification loops close the quality gap. Agents are prompted to test as a human user would, through browser automation, curl commands, and test runners. Features may only be marked as passing after verified end-to-end testing, not just code review.

The training and harness feedback loop

Today's agent products are post-trained with models and harnesses in the loop, creating a tight coupling that has fascinating implications. Models trained with a specific harness can become overfitted to that harness's tool signatures. OpenAI found that changing the apply_patch tool logic in Codex led to noticeably worse model performance, even though a general model should handle different patch methods interchangeably.

This creates a cycle: useful primitives are discovered in production, added to the harness, the model is trained with the harness in the loop, the model improves within that harness, and the cycle repeats. But the takeaway is counterintuitive: do not assume the vendor's default harness is optimal for your task. The Terminal-Bench data proves there is enormous performance to unlock by optimizing the harness for your specific use case.

What harness engineering is not

It is worth drawing clear boundaries. Prompt engineering is one component of a harness, not a synonym. Context engineering is the primary delivery mechanism inside a harness, but harness engineering is the broader discipline. MLOps concerns model training and deployment pipelines, while harness engineering is about orchestrating agent behavior in real-time execution. Agent frameworks provide building blocks; the harness is the actual runtime system governing behavior in production.

If prompting is "turn right," harness engineering is the road, guardrails, signs, and traffic system that lets ten vehicles navigate safely at once.

What comes next?

Four trends will define the next phase.

Context durability becomes the bottleneck. The gap between models shows not on leaderboards but across hundreds of tool calls over many context windows. Harnesses will detect exactly when a model drifts and feed that data back into training.

Stay lightweight to survive model updates. Every model release changes the optimal agent structure. Build harnesses that let you remove logic that models have absorbed natively. Over-engineer the control flow and the next model update breaks your system.

Parallel orchestration at scale. Orchestrating hundreds of agents working simultaneously on a shared codebase, maintaining coherence, avoiding conflicts, and managing a shared ledger of work, remains an open research problem.

Self-analyzing harnesses. Agents that analyze their own execution traces to identify and fix harness-level failures automatically, closing the human steering loop without manual intervention.

The question for every team building AI agents today is no longer which framework to use. It is what your harness looks like, and whether you are engineering it with the same rigor you bring to the model itself.

Sources:

LangChain · The Anatomy of an Agent Harness (March 2026). Harness primitives, the Ralph Loop pattern, and the deepagents research library.
LangChain · Improving Deep Agents with Harness Engineering (February 2026). Terminal-Bench 2.0 results: top 30 to top 5 by changing only the harness.
Anthropic Engineering · Effective Harnesses for Long-Running Agents (March 2026). The two-agent pattern, structured feature lists, and compaction strategies for long-horizon work.
Birgitta Böckeler, Thoughtworks · Harness Engineering for Coding Agent Users (April 2026). Control theory framework (guides vs. sensors), harnessability, and the three regulation categories. Published on martinfowler.com.
OpenAI · Harness Engineering: Leveraging Codex in an Agent-First World (February 2026). AGENTS.md as table of contents, golden principles, reproducible environments, and 1M+ lines of agent-written production code.
Philipp Schmid · The Importance of Agent Harness in 2026 (January 2026). Context durability, the Bitter Lesson applied to harnesses, and future directions for model-harness convergence.