April 2026
I Built an AI Agent That Writes All My Production Code

I'm Mike. I've been writing software for 25+ years. For the last year, an AI agent called AM has been writing all of my production code. Not autocomplete. Not copilot suggestions I accept or reject. Fully autonomous task execution — I give it a ticket, it reads the codebase, writes the code, runs the tests, commits, and moves on.
AM built my entire portfolio site — augmentedmike.com. Every page, every component, every deployment. I direct strategy and make architecture decisions. AM executes. That's the arrangement.
This post is about the engineering that makes it reliable. Not the prompting tricks — the actual architecture decisions that determine whether an autonomous agent ships working code or produces expensive garbage.
The agent is stateless. That's the whole trick.
Every agent framework I evaluated tries to maintain state in the model's context window. They carry conversation history, accumulate knowledge across turns, build up a mental model of the project. It sounds smart. It doesn't work.
Context windows drift. The model "forgets" instructions from 10,000 tokens ago. Hallucinations compound — one wrong assumption becomes the foundation for the next ten decisions. And when something goes wrong, you can't audit what the model "knew" at the time because the state was inside the model, not in files you can read.
AM is stateless. Every run is one-shot: read state from files, do one unit of work, write state back, exit. The agent carries nothing between invocations. Zero. The files are the source of truth.
This sounds like a limitation. It's the key to the whole system. If a run fails, you run it again — it reads the same files and tries again from a clean state. No accumulated errors. No context window corruption. No debugging "what was the model thinking." The files tell you everything.
Three-tier memory (because stateless doesn't mean amnesic)
A stateless agent still needs to know things. It just can't store them inside the model. So I built three external memory systems, each modeled on how human memory actually works.

Short-term memory is a folder of markdown files at workspaces/memory/st/. Every agent reads every file in this folder before it starts work. No search. No retrieval step. Whatever's in st/ is always injected, always applied. This is where unconditional rules live — "never use git stash," "the database schema changed, use the new column name," "the client hates blue buttons." Small, fast, always on. Like working memory in your brain — the stuff you hold in your head right now.
Long-term memory is a SQLite database with FTS5 (full-text search) at workspaces/memory/lt/memory.db. When the agent needs context, it runs memory recall "query" and gets ranked results. Research notes, project-specific context, lessons from past failures. The agent has to ask for it — it doesn't inject automatically. This keeps context clean and focused.
Episodic memory is the git history. Every iteration produces a log file at iter/*/agent.log. The commit message timestamps what happened. An agent can read any past iteration's log to understand what was tried, what failed, and what was learned. When a task ships, all the iteration commits squash into one clean commit — the noisy step-by-step record compresses into a single fact: "this task was completed."
Gated verification (the agent can't lie about being done)
Tasks move through four states: backlog → in-progress → in-review → shipped. Each transition has a gate enforced by code.
A task can't leave backlog without criteria.md written. It can't leave in-progress without all criteria having implementation. It can't ship without every acceptance criterion verified against the actual output and tests passing.
The gates are deterministic. The agent doesn't self-report "I think this works." The system checks. If verification fails, the task goes back to in-progress with a log entry explaining what failed. This is the difference between an agent that produces working code and one that produces confident-sounding garbage.
Worktree isolation (multiple agents, zero conflicts)
Each task gets its own git worktree. The agent works in an isolated copy of the repo. When the task ships, the worktree squash-merges into the integration branch. Clean linear history.
This means I can run multiple agents on different tasks simultaneously. Agent A is building a new feature while Agent B is fixing a bug in a different part of the codebase. They can't step on each other because they're in separate worktrees. When both ship, the merges are clean because each one is a self-contained diff.
The startup serialization is one of the weirder implementation details — I had to add a lock to prevent concurrent OAuth token refresh races when multiple Claude CLI processes start at the same time. Four seconds of hold time per startup, then they run fully concurrent. Small thing, but it took a day to debug.
ClaimHawk — where the agent meets the real world
The other project that pushed this architecture was ClaimHawk — a system that automates dental insurance claims. This is where AI stops being about text generation and starts being about interacting with legacy software that was never designed for automation.
Insurance carrier portals don't have APIs. They have websites from 2008 with session timeouts and CAPTCHAs. I tried Playwright with CSS selectors for months. Every portal redesign broke everything. Then I switched to computer vision — the system reads the screen semantically, like a human would. When the portal changes its layout, the vision model adapts without code changes. The maintenance cost went from "constant firefighting" to "near zero."
The models are fine-tuned Qwen3 running on-prem, not cloud APIs. Dental claims involve patient health data — HIPAA means that data can't hit someone else's servers. Open-weight models trained on real claim data, running on hardware the practice controls. The tradeoff is you own the fine-tuning pipeline. The upside is you own the capability permanently.
What actually matters (and what doesn't)
After a year of running autonomous agents in production, the stuff that matters isn't what I expected.
Prompting barely matters. The architecture matters. How you structure state, how you verify output, how you handle failure — that's 90% of whether the agent produces working code. Prompt engineering is a rounding error compared to having deterministic verification gates.
Statelessness is non-negotiable. Every system I've seen that tries to maintain state in the model eventually produces a bug that nobody can explain because nobody can audit what the model "knew." Files are auditable. Model state is not.
Local models outperform on domain tasks. A fine-tuned Qwen3 on dental claims data beats GPT-4 on dental claims processing. Not close. The general model is smarter overall but doesn't know what a CDT code is. The fine-tuned model is less capable generally but knows the domain cold.
Vision beats scraping. If you're automating interaction with any UI that you don't control, computer vision is more resilient than DOM-based approaches. Higher initial investment, dramatically lower maintenance.
The system is open-source. You can see it at helloam.bot. Happy to answer questions about any of it.
I build these kinds of systems for clients too — autonomous AI pipelines, custom automation, production ML. If your business has a process that's eating time, I can probably automate it.