Now I have everything I need. Let me write the full article.
SUBJECT: Why your coding agent keeps failing PREVIEW: The model isn't the problem. Here's the architecture failure nobody talks about. TITLE: The reason most coding agents fail isn't the model — it's the architecture
Most engineers blame the model when their coding agent fails.
Wrong model. Wrong temperature. Not enough reasoning tokens. So they upgrade from Sonnet to Opus. They switch from GPT-4o to Claude. They add more examples to the system prompt. And it still fails — just slightly differently.
The model is not the bottleneck. I know this because I've watched agents fail consistently across four different frontier models while running identical architectures. GPT-4o, Claude Sonnet, Gemini 2.5 Flash, Llama 70B. The failure modes were identical. Context starvation. Undirected loops. Deterministic gates replaced by model judgment calls. The architecture was broken — and a smarter model running inside a broken architecture is just a smarter agent making the same structural mistakes faster.
This is a deep dive into how coding agents actually fail — and what the architecture looks like when they don't.
The Real Failure Taxonomy
There are four ways a coding agent fails. None of them are about model quality.
| Failure Mode | Symptom | Root Cause | |---|---|---| | Context Starvation | Agent edits wrong files, misses dependencies | No codebase hydration before execution | | Non-deterministic Gates | Tests pass in agent's mind, fail in CI | LLM decides whether to lint, not the system | | Unconstrained Loops | Agent spins, burns tokens, produces nothing | No iteration ceiling, no state | | Tool Poison | Agent gets lost mid-task | Tools return too much context, poison the window |
Engineers see the symptom — the agent edited the wrong function, or wrote code that doesn't even run — and conclude the model didn't understand. The model understood fine. The system gave it nothing to understand from.
Context Starvation: Agents That Run Blind
Every time you ask a coding agent to fix a bug or implement a feature, it starts with an empty working memory.
That's the problem.
A senior engineer who gets a ticket doesn't open an empty editor. They open the codebase. They look at the related files, the README, recent commits. They search for the function being modified and read what calls it. They check if there's a test for it. Only then do they start typing.
A naive coding agent gets: "Fix the auth middleware to use JWT instead of session tokens."
That's it. No context. No file structure. No knowledge of what auth_middleware.py imports, what it returns, what tests exist for it. The agent is not stupid — it's blind.
Here's the context hydration approach that actually works, from the Minion system running on this machine:
# minion_context.py — 3-layer context hydration def build_context(issue_title: str, issue_body: str, project_dir: str) -> str: """ Layer 1: Vector store — past decisions, feedback, similar fixes (ChromaDB) Layer 2: Codebase — file structure, README, related files (filesystem search) Layer 3: Library docs — detected dependencies flagged for Context7 lookup """ memories = _hydrate_memory(issue_title, issue_body, limit=5) codebase = _hydrate_codebase(project_dir, issue_title, issue_body) deps = _hydrate_deps(project_dir) return _render_context_md(memories, codebase, deps)
The cap is MAX_CONTEXT_CHARS = 16_000 — roughly 4,000 tokens. That's non-trivial but not catastrophic. It runs before the LLM touches anything. The agent's first read is CONTEXT.md, not a blank file.
The quality difference is not subtle. Without hydration, agents make structural mistakes — wrong import paths, missing dependencies, editing the wrong layer of a feature. With hydration, the same model navigates the codebase correctly on the first pass.
The implication: before you tune the model, measure how much of the codebase you're actually giving it. Zero context in → zero quality out, regardless of model.
The Non-Deterministic Gate Problem
This is the failure mode that kills agent pipelines in production.
When a coding agent writes code, someone — or something — has to verify it. Does the code run? Do the tests pass? Is it formatted correctly? In a naive architecture, that verification is delegated back to the LLM. The agent writes code, then reasons about whether the code is correct.
That is the same as asking the student to grade their own exam.
LLMs are optimistic about their own output. They will tell you the tests pass when they haven't been run. They will tell you there are no linting errors when the formatter hasn't been called. This isn't hallucination in the scary sense — it's the model doing the best it can with no external signal.
The fix is architectural: deterministic gates run unconditionally, before the LLM evaluates anything.
# minion_blueprints.py — Blueprint flow for a developer agent developer_blueprint = Blueprint(nodes=[ Node("branch", NodeType.DETERMINISTIC, NodeAction.GIT_BRANCH), Node("hydrate", NodeType.DETERMINISTIC, NodeAction.CONTEXT_BUILD), Node("implement", NodeType.AGENTIC, NodeAction.IMPLEMENT), # LLM writes code Node("format", NodeType.DETERMINISTIC, NodeAction.FORMAT), # ruff format . Node("lint", NodeType.DETERMINISTIC, NodeAction.LINT), # ruff check . Node("test", NodeType.DETERMINISTIC, NodeAction.TEST), # pytest -v Node("evaluate", NodeType.DETERMINISTIC, NodeAction.EVALUATE), # did gate pass? Node("commit", NodeType.DETERMINISTIC, NodeAction.GIT_COMMIT), ])
Read the blueprint. The LLM has exactly one agentic node: implement. Everything else — format, lint, test, commit — runs deterministically. The system calls ruff format . and checks the exit code. It doesn't ask Claude whether the code is formatted.
# minion_gates.py — Project-specific deterministic gates PROJECT_GATES = { "python": { "lint": ["ruff", "check", "."], "test": ["python3", "-m", "pytest", "-v", "--tb=short"], "format": ["ruff", "format", "."], }, "node": { "lint": ["npx", "eslint", "."], "test": ["npm", "test"], "format": ["npx", "prettier", "--write", "."], }, "go": { "lint": ["go", "vet", "./..."], "test": ["go", "test", "./..."], "format": ["gofmt", "-w", "."], }, }
Each gate returns {ok: bool, output: str}. If lint returns ok: False, the blueprint routes to a FIX node — another agentic call, but now with the actual linting error as input. Not the agent's guess about what the error might be. The real stderr output.
The implication: an LLM reasoning about whether its own code is correct is not a verification step. It's a confidence interval of one. Deterministic gates are the only real verification. Design them first, add agents second.
Unconstrained Loops: How Agents Eat Themselves
Give an agent a failing test and no iteration ceiling, and it will try to fix it forever.
This sounds obvious. In practice, almost nobody enforces it architecturally. The agent gets a test failure, rewrites the function, runs the test again in its imagination, decides it passes, and reports success. Except it didn't run the test. And if it does run the test and it fails again, it rewrites the function again. And again.
The Minion runner enforces a hard ceiling:
MAX_RUN_TIME = 1800 # 30 minutes, wall clock # Blueprint logic — max 2 iterations per agentic node Node("implement", NodeType.AGENTIC, NodeAction.IMPLEMENT), # → if evaluate fails: Node("fix", NodeType.AGENTIC, NodeAction.FIX), # one retry Node("lint", NodeType.DETERMINISTIC, NodeAction.LINT), Node("test", NodeType.DETERMINISTIC, NodeAction.TEST), Node("evaluate", NodeType.DETERMINISTIC, NodeAction.EVALUATE), # → if still fails: abort, report, do not loop again
Two iterations. Implement → Evaluate → Fix → Evaluate → Done. If the agent can't fix a failing test in two passes with the actual error message in hand, the issue is not the model or the prompt. It's the task decomposition upstream.
That last point is important. Most agent loops are a symptom of the task being too large. A well-decomposed task — "add a validate_token function to auth/middleware.py that returns a bool given a JWT string" — has a clear success condition and a bounded scope. A poorly decomposed task — "fix the auth system" — has neither, and the agent will spin until it hits the context limit or the timeout.
The implication: bounded execution is an architectural decision, not a model capability. Set the ceiling before you run the agent. If it fails twice, the task needs decomposition — not more tokens.
Tool Poison: When Helpful Tools Become Hostile
Tool design is the least discussed and most impactful part of agent architecture.
When a coding agent has access to a tool that searches the codebase, the natural instinct is to return as much context as possible. The agent asked — give it everything. But an agent that calls a code search tool and gets 40 files of output has poisoned its own context window. It now has to reason about 40 files. It loses track of the original task. It starts hallucinating connections between unrelated parts of the codebase.
The right tool design principle: tools should return minimum viable context, not maximum available context.
Compare:
# Bad — unbounded tool return def search_codebase(query: str) -> str: results = grep_recursive(query, root_dir="/repo") return "\n".join(results) # could be 100+ files # Good — bounded tool return with relevance filter def search_codebase(query: str, limit: int = 5) -> list[dict]: results = semantic_search(query, top_k=limit) return [{"path": r.path, "snippet": r.text[:500]} for r in results]
The bounded version doesn't just reduce tokens — it forces the tool to rank. Ranking means the agent gets the most relevant 5 files, not all 40. The agent makes better decisions with less information, consistently.
There's a parallel in prompt structure. The research on Claude's attention patterns shows that data placed at the top of a prompt with the query at the bottom produces better retrieval than the reverse. The minion context file follows this exactly:
# CONTEXT.md (agent reads this before writing a single line) ## Relevant Past Decisions (from memory store) [5 semantically matched memories] ## Codebase Structure [file tree, limited to 3 levels deep] ## Related Files [top 5 files matched to issue keywords] ## Detected Dependencies [libraries to check docs for] ## Task [the GitHub issue, at the bottom]
Data first. Query last. Always. Not because it's stylistically tidy — because the attention mechanism retrieves better when context grounds the query rather than the other way around.
The Shared Context Window as Mutable State
Here's the failure mode that's hardest to see until you're inside it.
In a multi-turn agent loop, the context window is shared, mutable state. Every tool call, every intermediate result, every assistant message — all of it accumulates. By turn 12, the agent is reasoning about its own earlier reasoning, which was based on partially incorrect tool output from turn 4.
This is not a model problem. A human reading the same transcript would make the same mistakes.
The architectural solution is to treat context like a write-once log, not a working document.
Structured agent systems enforce this with compaction policies. The OpenClaw gateway — which hosts 6 agents on this machine — uses a reserveTokensFloor: 40000 compaction threshold. When the context approaches that ceiling, older exchanges get summarized and compressed. The agent's effective working memory is the last N turns plus the compacted summary, not the full unedited transcript.
{ "agents": { "defaults": { "compaction": { "reserveTokensFloor": 40000 } } } }
Forty thousand tokens reserved. Not for the agent's response — for the context the agent needs to reason correctly. The compaction fires before that floor is hit, ensuring the agent never enters a turn with less than 40k tokens of working space.
Without this, long agent sessions degrade. The model hasn't changed. The task hasn't changed. The context has become unmanageable.
The implication: design the context management strategy before you design the agent's tools or prompts. It's the load-bearing wall.
Where This Falls Short
The blueprint model works well for well-defined, bounded tasks. It struggles at the edges.
The two-iteration ceiling is conservative by design — it catches runaway loops at the cost of occasionally giving up on fixable problems. A test that fails due to a transient environment issue, not a code issue, hits the ceiling and gets flagged. That's a false negative: the code was correct but the system declared failure. In practice, this is preferable to the alternative (infinite retry), but it means the human still reviews a subset of failed runs that aren't actually broken code.
The context hydration is semantic, not structural. It finds related files by keyword and vector similarity, which means it misses dependency chains that aren't semantically obvious. An agent asked to modify a cache layer might not get the cache invalidation logic from a completely differently named module — unless the hydration specifically searched for callers of the cache function. This requires tuning the hydration queries per project, which is manual overhead.
And the deterministic gates are only as good as the test suite they enforce. An agent with 40% test coverage will still pass its gates while shipping broken code. Deterministic gates enforce the rules you've already written. They don't write the rules for you.
The Architecture That Actually Works
Here's the structure, end to end.
Ticket → Context Hydration → CONTEXT.md
↓
Agent Node (LLM): implement
↓
Gate: format → lint → test → evaluate
↓ (if fail)
Agent Node (LLM): fix [with real error output]
↓
Gate: lint → test → evaluate
↓ (if fail)
Abort → report → human review
↓ (if pass)
Gate: commit → push → create_pr
No agent decides whether to lint. No agent decides whether tests pass. No agent loops more than twice without human review. The model does exactly one thing: write code given grounded context. Everything around it is deterministic.
The model doesn't need to be smarter. The architecture needs to not be stupid.
The Real Bottleneck
There's a mental model from Goldratt's Theory of Constraints that applies exactly here: a system is only as fast as its slowest part. Optimizing a non-bottleneck is waste.
The AI community spends enormous energy optimizing the model — the one part of the system that frontier labs are already optimizing on your behalf, with billions of dollars and thousands of researchers. Meanwhile the context hydration is an afterthought, the gates don't exist, the loops are unconstrained, and the tools return 10x more context than the agent can use.
The bottleneck isn't the model. Fix the bottleneck.
What to Do Monday
Audit your agent's architecture against four questions:
- Context: What does your agent know about the codebase before it writes the first line? If the answer is "whatever's in the prompt," you have context starvation.
- Gates: When your agent says tests pass, did the tests actually run? If a subprocess didn't execute and check the exit code, the gate doesn't exist.
- Iterations: What happens on the third retry? If the answer is "it keeps retrying," you have an unconstrained loop.
- Tools: What's the largest possible output from any single tool call? If you don't know, you have a tool poison risk.
Fix one of these this week. Not the model. Not the prompt. The architecture.
Next post: the context window as a data structure — how to design agent memory so it doesn't degrade over a 10-turn session. That one's paid.
If this changed how you're thinking about your agent stack, paid subscribers get this depth on one specific system per week — implementation-level, not conceptual.
Subscribe — $8/month or $80/year (save 2 months)
Got a coding agent failure mode I didn't cover? Reply and tell me what broke — the more specific, the better.