The Agentic Engineer [Part 2/3]: Shipping a Feature Without Losing the Thread

TL;DR

The Agentic Engineer · Part 2 of 3. A series on running an AI coding agent as a production engineering discipline. Part 1: Principles & the daily operating model · 2. The full pipeline · 3. The brownfield.
This is the artifact-driven pipeline in full: nine steps that collapse a feature's decisions into a reviewed spec before code exists, so each decision is made once and never silently re-litigated.
Do not run this on small work. The gates are leverage in proportion to how many decisions a task contains. On a feature they are the difference between shipping and rework; on a typo they are theater.
One idea, applied repeatedly: push every decision and every defect as far left as it will go, and never advance a stage on an assertion when you could advance it on evidence.

In Part 1 I argued that structure beats prompting on any task with enough decisions to get wrong, and I gave the math: if the agent is right 80% of the time per decision and a feature has 20 of them, unguided you land all of them about 1% of the time. The fix isn't a smarter prompt. It's collapsing those decisions into a reviewed artifact before code exists, so each one is made once, deliberately, and never silently re-litigated mid-build.

This article is that artifact-driven pipeline in full. A warning up front, because it's the whole point: do not run this on small work. A rename has nothing for a spec to de-risk, so the pipeline there is pure tax. I run it only on Large tasks: multi-file, uncertain approach, real blast radius, the kind where I'd later want a spec to point back to. Forcing small work through the heavy lane is exactly how the heavy lane loses credibility and stops getting used when it matters.

Here are the nine steps, and why each one exists.

Step 1: Branch, and isolate parallel work

Nothing of substance touches main. New task, new branch. When I expect to run independent sub-tasks at the same time, I use git worktrees so isolated sessions never collide on the same working tree. This is plumbing, but it's the plumbing that makes everything after Step 6 safe to parallelize.

◆ Background · git worktrees

Why worktrees, not branches alone

A single Git checkout has one working directory and one index, so two agents editing the same tree fight over .git/index.lock and over each other's uncommitted files. git worktree checks out additional branches into separate directories that share one object store: each gets a private HEAD, index, and working tree, while history stays common. That isolation is why worktrees have become the default primitive for running multiple coding agents at once, and why Claude Code's docs recommend them for multi-session work. The caveat the author's Step 6 handles: worktrees isolate the tree, not shared databases, ports, or caches, so a contract still has to coordinate anything below the filesystem.

Step 2: Plan mode: understand before deciding

I open plan mode and let the agent read the relevant files and answer questions without making any changes. The discipline here is to point it at specific files and existing patterns rather than describing them from memory. Description is lossy and often wrong; the code is ground truth. The output of this step isn't a plan yet, it's shared understanding. Deciding before understanding is how you get a confident implementation of the wrong thing.

Step 3: The implementation plan, edited by hand

Now the agent produces a detailed plan: which files change, the data flow, the sequence of work. I open it in my editor and edit it directly before anything proceeds, and I guard the turn with an explicit phrase: do not implement yet. That guard matters more than it looks. The default failure mode is the agent reading "plan" as "permission to start," and once it starts, the plan stops being a decision document and becomes a post-hoc rationalization of code already written.

Step 4: The spec, made self-contained

I have the agent write a self-contained SPEC.md. The best specs share three traits: they name the actual files and interfaces involved, they state explicitly what is out of scope, and they end with an end-to-end verification step that proves the feature works. Out-of-scope is the underrated one; most scope creep is just a spec that never drew its own boundary.

For larger features I don't write the spec straight through. I start minimal and have the agent interview me, using the AskUserQuestion tool, with a standing instruction to dig into the hard parts I might not have considered. Then it writes the spec from the interview. This inverts the usual dynamic: instead of me trying to anticipate every gap, the agent surfaces the gaps and I adjudicate them. Time spent making the spec precise pays back more than any time spent watching the implementation, because the spec is where the 20 decisions actually get made.

◆ Background · spec-driven development

The spec as source of truth

Spec-driven development (SDD) writes a structured specification of what the software should do before any code is generated, and treats that spec as the authoritative source of truth for both humans and the agent. The premise: models are strong at pattern completion but poor at mind-reading, so a vague prompt yields plausible code built on dozens of unstated, often wrong assumptions (Osmani, "How to write a good spec for AI agents"). The pattern has hardened into tooling: GitHub's Spec Kit formalizes a Specify → Plan → Tasks → Implement flow, and AWS Kiro builds an IDE around it. The author's twist is the AskUserQuestion interview, which makes the agent hunt for the gaps instead of waiting for the human to anticipate them.

Step 5: Review the spec before any code exists

Before a single line is written, the enterprise-standards reviewer (a fresh-context subagent, see Part 1) reviews and revises the spec for cloud compatibility, modularity, security posture, and our coding guidelines. The economics here are the entire argument: catching a contradiction or a security risk in a spec is nearly free. Catching the same defect after it's been implemented across six files, with tests written against the wrong assumption, is not. Every defect you can push left of the first commit is a defect that costs you a sentence instead of a day.

Step 6: Lock the data contracts

This is the load-bearing step, and on a Large task I never skip it. Before splitting work, I lock the data contracts: the shapes and interfaces that cross module and service boundaries. Contracts are what let me hand different pieces to different agents (or different sessions) without them drifting apart. Two sub-tasks that agree on a contract can be built in complete isolation and still fit together on the first try. Two sub-tasks that share live state instead of a contract will diverge, and you'll discover it at integration, which is the most expensive possible place to discover it.

Contracts before parallelism, always. If the contracts aren't lockable yet, the work isn't ready to split.

Step 7: Decompose, then parallelize only what is genuinely independent

I break the spec into sub-tasks scoped to those contracts, then sort them by coupling.

Genuinely independent sub-tasks (they share only the contract, not live state) fan out. Either parallel sessions, or, for large mechanical work, non-interactive runs: claude -p in a loop, scoped tightly with --allowedTools, and tested on two or three items before I run it at scale. Never trust a batch job you haven't watched complete a few iterations of by hand.

Coupled sub-tasks stay in the main session and get implemented sequentially, where I can watch and redirect. This is the part I had to learn the hard way, and I covered it in Part 1: I do not push coupled implementation into subagents just to keep my context clean. A subagent works blind to the main conversation, so it hands back a lossy summary of code I never saw, and that's precisely when I lose the thread and the ability to course-correct. The pollution I'd avoid is cheaper than the blindness I'd buy.

Step 8: The build/verify loop, gated and arbitrated

For each unit of work, I loop on three things.

A. Test-driven development. Write the failing test first, then the code to pass it. The test isn't a chore appended at the end; it is the verification criterion that closes the loop. With it, the agent can work the loop itself: write, run, read the result, iterate until green. Without it, "looks done" is the only signal and I become the verifier.

B. Review, scaled and arbitrated. Each change passes the relevant gates. Agent-based gates: a fresh-context reviewer, plus the security reviewer whenever auth, data, or infrastructure is touched. Tool-based gates: the static-analysis quality gate, linters, types. The discipline that keeps this from collapsing into noise is twofold. First, every reviewer reports on a fixed severity scale (blocker / major / minor / nit) and is instructed to flag only correctness and requirement gaps, never style preferences. A reviewer told to "find what's wrong" will always find something; a reviewer told to flag blockers reports blockers. Second, one agent arbitrates. The enterprise-standards agent dedupes findings across reviewers, discards the noise, and emits a single prioritized must-fix list. Only blockers and majors gate the PR. And I keep a rough eye on each gate's precision over time: a reviewer whose "must-fix" findings are mostly noise is costing more attention than it saves, and it gets retired or retuned. More reviewers is not more safety. Past a point it's just triage I pay for.

◆ Background · the noisy reviewer

Why the severity scale and the arbiter matter

The author's "a reviewer told to find what's wrong will always find something" is not just intuition. Recent work finds LLM code reviewers exhibit systematic overcorrection: they frequently misclassify correct, requirement-compliant code as defective or non-compliant ("Are LLMs Reliable Code Reviewers?", arXiv 2603.00539). Survey work on LLM-as-a-judge reaches the same operational conclusion: reliability has to be engineered through constrained rubrics and bias mitigation, not assumed. A fixed blocker/major/minor scale plus a single arbiter that dedupes and discards is exactly the kind of constraint that turns an unreliable judge into a usable gate, which is why stacking more reviewers without arbitration buys triage, not safety.

C. End-to-end verification. If it's web-based, a real browser drives the e2e tests and compares screenshots against the design. If it's not, programmatic checks run against the contracts and fixtures. The non-negotiable across both: nothing is done until the actual verification command has run and its output is shown as evidence. Never asserted, always demonstrated.

Anything that turns out unclear mid-build gets resolved by asking me through AskUserQuestion, rather than the agent guessing and me discovering the wrong assumption three steps downstream.

Step 9: The PR, with a real description

The agent opens a PR with a description a human can actually review against: what changed, the features delivered, coverage reports, the arbitrated must-fix list and how each item was resolved, and anything a reviewer specifically needs to know. main only ever changes through a reviewed, merged PR. The PR description is not paperwork; it's the evidence trail that lets the human reviewer spend their attention on judgment instead of archaeology.

When I step away: bounded autonomy, not trust

The pipeline above assumes I'm at the keyboard. But the heaviest mechanical work (large migrations, test backfilling, lint and type cleanup, dependency bumps, doc generation) is exactly the work I'd rather not babysit. So I let the agent run while I'm away, and I make it safe with bounded autonomy rather than trust.

Auto mode plus a sandbox. Running in auto permission mode lets routine work proceed while a classifier blocks scope escalation, unknown infrastructure, and hostile-content-driven actions; in non-interactive runs it aborts rather than stalls when it keeps getting blocked. I pair that with OS-level filesystem and network isolation, so the run is constrained by the environment, not merely by the model's judgment. Two independent layers, because a single layer of "the model decided not to" is not a control.

A goal condition with the stop-gate as backstop. I encode the success check as a goal condition, and a separate evaluator keeps the run working until that condition actually holds, not until the agent feels finished. Underneath it, a Stop hook running lint, typecheck, and tests is the deterministic floor that won't let a turn end red.

An adversarial reviewer before I count anything as done. A fresh-context reviewer runs over the full diff and reports gaps against the spec, so I return to a reviewed result, not a raw one.

Evidence over assertion. Every autonomous run leaves behind the test output, the commands it ran, and screenshots. Reviewing on return is then faster than re-verifying the work myself, which is the only way unattended runs actually save time instead of just relocating it.

The shape of the whole thing

Strip it down and the pipeline is one idea applied repeatedly: push every decision and every defect as far left as it will go, and never advance a stage on an assertion when you could advance it on evidence. Understand before you plan. Plan before you spec. Review the spec before you write code. Lock contracts before you parallelize. Write the test before the implementation. Show the verification output before you call it done. Each gate is cheap relative to the cost of discovering the same problem one stage later.

That's why this lane is reserved for Large work. The gates are leverage exactly in proportion to how many decisions a task contains. On a feature, they're the difference between shipping and rework. On a typo, they're theater. Knowing which is which is the actual skill.

Push every decision and every defect as far left as it will go, and never advance a stage on an assertion when you could advance it on evidence.

In the series: Part 1: Claude Code Is Not a Chatbot. The principles and the daily operating model the whole practice rests on. Part 3: Into the Brownfield. Everything here assumes greenfield, where you set the conventions. Next: what changes when you inherit them instead.