Why I Benched My AI Agent
Foreword
Well then, this is a story about the one time I allowed my AI agent to co-write a complex library, and what I learned from the experience.
The following post is my subjective experience, and part of the reason I published independent research papers on AI limitations. If you want the formal derivation, head over to Zenodo and read my thesis(es): https://zenodo.org/search?q=metadata.creators.person_or_org.name%3A%22Aswin%2C%20Alwyn%22
Fair warning: those papers are math-heavy.
If you just want a more distilled take on AI development, Iâve also written about the subject on my blog: https://arpeggio.one/content/blogs
One Job: Glue the State Machines Together
Letâs talk about my experience having AI write stateful core like Dumbo, where real engineering happens, and how I had to take over and kick it, and the agents too, to the curb.
It started off fine. DTOs got generated, some were hand-built. I declared intent.
I even built the state machines.
I gave AI one job: glue my state machines together. Thousands of lines of code later, nothing was happening, invariants werenât observed. I got methods as wide as Niagara. float, int, float, int, str, str. Itâs almost musical. My class looks like a music sheet. No, make it a music tape, like those 1800s piano rolls youâd see in the Wild West. Completely unreadable to humans, apparently to machines too.
Token burn went up, but nothing got done.
I shoved the AI aside to show it whoâs boss, and how it was done in the good old days.
The Niagara Phase
My 745 lines of code are now 88 lines. I can reason about it. I know it has only the things I need, no fluff, no extra. SOLID. Especially the SO part: Single Responsibility, Open to extension, closed to modification.
When I had AI do the job, concerns were smeared all over classes, encapsulation weakened, and when asked to make portions functional, aka delegates, it balked. Everything turned imperative. Method A calls Method B calls Method C, except Method C has 10 arguments, so Method B has at least 10 arguments, and Method A has at least as many as Method B. A mess. It also didnât work.
Makes me wonder about all the companies touting AI writing all the code. Makes me wonder about chasing lines of code written as a metric. Maybe the best measure of quality isnât LoC, but code not written. Quality shouldnât be measured by LoC, but by something closer to Kolmogorov complexity: the shortest piece of code that does the job. AI is failing that badly, and the bill is coming due. Anthropic will make sure of that.
You know why I know? Because once I let AI touch my code, once I let it smear meaning across abstractions, I no longer wanted to touch the code. So I prompted harder, and it bounced. Now to fix it, it needs the old code, the reason why I was displeased, and the refactor of the refactor.
The Big Rip
Sure, you can build guardrails around this, but remember what I said: the best code is the code that didnât have to be written. Humans love rediscovering the old adage that prevention is better than cure. You can make the code right, right away, by writing it yourself. Or you can tell AI to go fix it and pay the Codex or Claude tax after itâs written 10,000,000 lines of code. Prolific isnât a good word when it comes to agents.
There are two theories for how the universe will end: the Big Crunch and the Big Rip. At our current trajectory, all the signs are pointing to the latter as the more likely end than the former.
I still use AI, but if you want to know how I actually use AI without landing in the mess I just described, you can find the framework at my shop: Structured Delegation: Governing AI in Long-Lived Codebases. It follows my principles. Small, lean, effective.
Structured Delegation: Governing AI in Long-Lived Codebases https://arpeggio.one/shop/?focus=structured-delegation-ai