The Agentic Engineer [Part 1/3]: Claude Code Is Not a Chatbot. It's an Engineer You Build Structure Around.
Why the structure is the job, the prompting is secondary, and how I run an AI coding agent like a production engineering practice.
The Agentic Engineer · Part 1 of 3. A series on running an AI coding agent as a production engineering discipline. 1. Principles & the daily operating model · 2. The full pipeline · 3. The brownfield
- Rigor doesn't come from talent, it comes from structure. Treat an AI coding agent as an engineer you build structure around, not a snippet vending machine.
- Get the structure right and the prompting almost stops mattering. Get it wrong and no clever wording saves you.
- Match the weight to the task: a heavy pipeline for real features, a lightweight lane for small changes, so the heavy lane stays credible.
- Context is the only scarce resource. Almost every discipline below exists to keep the working context small and relevant.
I've spent a long career building software at scale, and the lesson that transfers most cleanly to AI coding agents is the oldest one in the book: rigor doesn't come from talent, it comes from structure. A great engineer with no process ships chaos. A disciplined system makes ordinary work reliable.
That's exactly how I treat Claude Code. It is not a snippet vending machine you ask clever questions. It is an agentic engineer, and my job is to build the structure it operates inside. Get the structure right and the prompting almost stops mattering. Get it wrong and no amount of clever wording saves you.
Here is the operating model I actually use, stripped to what earns its keep. One scope note: this is how I start and run new projects. Working in a large existing codebase changes the early moves, and that gets its own companion piece.
The agent, and the layers you build around it
Claude Code is a command-line coding agent that pairs a reasoning model with built-in tools for reading, editing, searching, and running code. The "structure" this article is about is its extension layer, documented by Anthropic: CLAUDE.md for always-on project context, skills for knowledge and workflows loaded on demand, subagents that run in isolated context and return only a summary, hooks that fire deterministically on lifecycle events, and permissions that decide what the agent is allowed to do at all. The piece below is one practitioner's discipline for wiring those layers together.
The six principles everything hangs off
Context is the only scarce resource. The context window holds the entire session: every message, file read, and command output. Quality degrades as it fills, and one debugging session can burn tens of thousands of tokens. Almost every choice below exists to keep the working context small and relevant. Context rot is the silent killer of output quality.
Why a full window degrades quality
"Context rot" is the measured tendency of language models to get less reliable as their input grows, well before the advertised window is full. Chroma's 2025 study tested 18 frontier models and found every one degraded as input length increased, with the effect worsening on more complex tasks. Earlier work by Liu et al. (2023), "Lost in the Middle," showed accuracy is highest when relevant facts sit at the start or end of the context and drops sharply when they are buried in the middle. That is the empirical backing for treating context as scarce and curating it aggressively.
Structure beats prompting, on tasks with enough decisions to get wrong. Assume the agent makes the right call ~80% of the time on a single decision. A real feature has ~20 decision points, so unguided you're at roughly 0.8²⁰, about 1%. A reviewed spec collapses those decisions into ones already made. An hour of planning routinely saves a day of rework. The same math runs in reverse: a rename has nothing to de-risk, so structure there is pure tax.
Match the weight to the task. This is the principle that keeps the rest from becoming ceremony. The full pipeline earns its cost on a feature and crushes a typo. I run an explicit lightweight lane for small changes precisely so the heavy lane stays credible.
Always give the agent a way to verify its own work. Tests, a build exit code, a linter, a screenshot diff. Without a check, "looks done" is the only signal and I become the verification loop. With a check, the loop closes itself. The discipline: nothing is done until the actual verification command has run and its output is shown as evidence, never asserted.
Determinism where it matters, advice where it doesn't. Three enforcement layers, and the skill is picking the right one. Config files advise (followed maybe 80% of the time). Permissions decide what's allowed at all and can't be bypassed. Hooks fire deterministically at lifecycle events. Anything that must always happen is a hook or a permission, never a line of prose.
Separate the writer from the reviewer, but give reviewers a budget. The model that wrote the code is biased toward it, so a fresh context grades it on its own terms. But a reviewer asked to "find what's wrong" always finds something. So reviewers report on a fixed severity scale and flag only what affects correctness or requirements. One agent arbitrates what actually blocks. More reviewers is not more safety; past a point it's just noise to triage.
The configuration layer is where the leverage lives
I treat all of it like code: reviewed, pruned, version-controlled.
The project config file stays short, specific, and universally true. It loads into every session, so every line costs context on every turn. Keep it under ~200 lines. Lead with the exact test, build, lint, and run commands. Include only what the agent can't infer. The test for each line: would removing this cause a mistake? If not, cut it. If the agent keeps ignoring a rule, the file is too long and the rule is getting lost. Prune; don't add "IMPORTANT" louder.
Skills are custom logic loaded on demand. Each distinct rule or workflow becomes its own skill that loads only when relevant, so it doesn't bloat every conversation. Domain rules, API conventions, error-handling patterns, the data-contract format. I also encode current library usage as skills rather than letting the agent guess from stale training data. Many MCP servers I used to rely on turned out to be replaceable by a well-written skill: cheaper and more controllable.
Subagents are specialized workers in isolated context. They run with their own window, tool allowlist, and system prompt. I use them for two things: keeping heavy work out of my main context, and getting unbiased review. My roster includes a frontend specialist carrying our design standards, a security reviewer, an enterprise-standards reviewer that also arbitrates findings, and a systematic debugger.
What I don't do is route all implementation through subagents. That backfired. They work blind to the main conversation, so coupled work comes back as a lossy summary of code I never saw, which is exactly when I lose the thread. So I'm surgical: investigation and review go to subagents, genuinely independent sub-tasks fan out in parallel, but coupled implementation stays in the main session where I can interrupt and redirect.
Three lanes, picked by decision count
Nothing of substance happens on main. But "of substance" is the operative phrase.
The honest test: if I can describe the finished diff in one sentence, it's Small. Unsure of the approach or it touches several files, at least Standard. Crosses module boundaries or wants a spec to point back to, Large.
Small (a one-sentence diff). Typo, log line, rename. Branch, make the change, run the check, commit. No plan, no spec, no reviewer gauntlet. Most tasks live here, and forcing them through the heavy lane is how the heavy lane loses credibility.
Standard (a few files, one clear approach). Plan mode, a short plan I edit, implement in the main session, verify, PR. Skip the formal spec and the fan-out; keep review proportional.
Large (multi-file, uncertain approach, real blast radius). The full pipeline: spec, locked data contracts, controlled fan-out, and multi-gate arbitrated review. That machinery is substantial enough to deserve its own piece, so I'm holding it for a follow-up rather than compressing it here into something misleading.
The honest test for which lane: if I can describe the finished diff in one sentence, it's Small. If I'm unsure of the approach or it touches several files, it's at least Standard. If it crosses module boundaries or I'd want a spec to point back to, it's Large.
Debugging: root cause before fix
When something breaks, the worst move is letting the agent change code to see what moves. That's shotgun debugging, and it burns sessions while polluting context with failed attempts. Every bug goes to a dedicated systematic-debugger subagent with one non-negotiable rule: it never touches production code until it has a proven, documented root cause. The urge to "just try a fix" is the signal the root cause hasn't been found yet.
Reproduce with a failing test, trace the real code path instead of inferring it, validate every assumption against evidence, write up the root cause and discarded hypotheses, then fix the root cause rather than the symptom, verify with actual command output, and guard the layers so the bug class can't recur silently. And the safeguard against doom loops: after three failed attempts it stops, writes up the suspected design flaw, and escalates. Three strikes usually means the problem is architectural, not local.
Session hygiene and a setup that learns
The discipline that keeps the whole thing fast: /clear between unrelated tasks. The "kitchen sink" session is the most common way to ruin output quality. The two-correction rule: if I've corrected the agent more than twice on the same issue, the context is polluted and pulling it back toward failed approaches. I clear and restart with a sharper prompt that bakes in what I learned. A clean session almost always beats a long one full of corrections.
Memory is what makes multi-session work continuous, and I keep two layers distinct. Durable memory (rules and knowledge universally true for the project) lives in config and skills. Episodic memory (decisions and reasoning from specific sessions) lives in a semantic store. Confusing the two is how config bloats and how lessons get lost.
Reflection closes the loop. When something recurs, I run a pass that proposes minimal changes: a new rule only if its absence caused a real mistake, a skill for load-bearing knowledge, a hook for anything that must happen every time. The key move is promotion: a lesson that's proven universally true graduates from episodic memory into permanent config. Reflection is to the setup what code review is to the code.
Promote recurring lessons into permanent config so the setup improves every session.
The one-paragraph version
Keep main clean and work on branches. Match the weight to the task. Keep the working context small: lean config, knowledge in on-demand skills, subagents for read-heavy investigation and review rather than for all implementation. Use the three enforcement layers correctly: config advises, permissions decide what's allowed, hooks force what must always happen. Plan and spec before code when there are enough decisions to get wrong. Give every task a check, never call it done until the check's output proves it, and never let the writer be the only grader. Lock data contracts before parallelizing. Send every bug to the systematic debugger and find the root cause before touching code. Promote recurring lessons into permanent config so the setup improves every session. And whenever you're unsure, ask instead of guessing.
The throughline, for anyone responsible for engineering quality at scale: the agent is capable, but capability without structure doesn't survive contact with production. The structure is the job.
Next in the series: Part 2: Shipping a Feature Without Losing the Thread. The full Large-task pipeline: spec, locked data contracts, controlled fan-out, arbitrated multi-gate review, and bounded autonomous runs. Part 3: Into the Brownfield. What changes when you point the agent at a large codebase you didn't write: you map before you build, and you inherit conventions instead of setting them.