general 2026-03-21 · Updated 2026-03-21

Why I Benched My AI Agent

hero

Foreword

Well then, this is a story about the one time I allowed my AI agent to co-write a complex library, and what I learned from the experience.

The following post is my subjective experience, and part of the reason I published independent research papers on AI limitations. If you want the formal derivation, head over to Zenodo and read my thesis(es): https://zenodo.org/search?q=metadata.creators.person_or_org.name%3A%22Aswin%2C%20Alwyn%22

Fair warning: those papers are math-heavy.

If you just want a more distilled take on AI development, I’ve also written about the subject on my blog: https://arpeggio.one/content/blogs

One Job: Glue the State Machines Together

Let’s talk about my experience having AI write stateful core like Dumbo, where real engineering happens, and how I had to take over and kick it, and the agents too, to the curb.

It started off fine. DTOs got generated, some were hand-built. I declared intent.

I even built the state machines.

I gave AI one job: glue my state machines together. Thousands of lines of code later, nothing was happening, invariants weren’t observed. I got methods as wide as Niagara. float, int, float, int, str, str. It’s almost musical. My class looks like a music sheet. No, make it a music tape, like those 1800s piano rolls you’d see in the Wild West. Completely unreadable to humans, apparently to machines too.

Token burn went up, but nothing got done.

I shoved the AI aside to show it who’s boss, and how it was done in the good old days.

The Niagara Phase

int

My 745 lines of code are now 88 lines. I can reason about it. I know it has only the things I need, no fluff, no extra. SOLID. Especially the SO part: Single Responsibility, Open to extension, closed to modification.

When I had AI do the job, concerns were smeared all over classes, encapsulation weakened, and when asked to make portions functional, aka delegates, it balked. Everything turned imperative. Method A calls Method B calls Method C, except Method C has 10 arguments, so Method B has at least 10 arguments, and Method A has at least as many as Method B. A mess. It also didn’t work.

Makes me wonder about all the companies touting AI writing all the code. Makes me wonder about chasing lines of code written as a metric. Maybe the best measure of quality isn’t LoC, but code not written. Quality shouldn’t be measured by LoC, but by something closer to Kolmogorov complexity: the shortest piece of code that does the job. AI is failing that badly, and the bill is coming due. Anthropic will make sure of that.

You know why I know? Because once I let AI touch my code, once I let it smear meaning across abstractions, I no longer wanted to touch the code. So I prompted harder, and it bounced. Now to fix it, it needs the old code, the reason why I was displeased, and the refactor of the refactor.

The Big Rip

Sure, you can build guardrails around this, but remember what I said: the best code is the code that didn’t have to be written. Humans love rediscovering the old adage that prevention is better than cure. You can make the code right, right away, by writing it yourself. Or you can tell AI to go fix it and pay the Codex or Claude tax after it’s written 10,000,000 lines of code. Prolific isn’t a good word when it comes to agents.

There are two theories for how the universe will end: the Big Crunch and the Big Rip. At our current trajectory, all the signs are pointing to the latter as the more likely end than the former.

I still use AI, but if you want to know how I actually use AI without landing in the mess I just described, you can find the framework at my shop: Structured Delegation: Governing AI in Long-Lived Codebases. It follows my principles. Small, lean, effective.

Structured Delegation: Governing AI in Long-Lived Codebases https://arpeggio.one/shop/?focus=structured-delegation-ai