A field guide to AI coding agents
What AI coding agents actually are, how they differ, and a practical way to think about handing real work to them.
"AI coding agent" has become one of those phrases that means five different things depending on who's saying it. For one person it's the autocomplete that finishes their line. For another it's a tool that takes a ticket, writes the code, and opens the pull request while they're at lunch. Those are not the same thing, and the difference changes how you work.
This is a map. Not a ranking, not a list of what's hot this month — the categories, what actually changes as you move between them, and how to hand real work to these tools without getting burned. It's deliberately vendor-neutral: the products change monthly, the shape of the space doesn't.
The spectrum
Most coding tools sit on a spectrum from "predicts what you'd type" to "does the task for you." Four rungs are worth naming.
Autocomplete. It completes the line or block you're already writing. You're driving; it's guessing the next few tokens. It lives inside your editor, optimized for low latency so it never breaks your flow. The value is real but narrow: it speeds up the keystrokes you were going to make anyway.
Chat. A panel you ask. It can see your file or selection, explain an error, draft a function, refactor a snippet. The work comes back as text or a suggested edit, and you decide what to apply. You're still the one integrating it into the codebase.
Agentic CLIs and IDE agents. You give it a goal, and it reads files, runs commands, edits across the repo, checks the result, and tries again. It operates in a loop with access to your tools. You review the diff instead of writing it. Claude Code, Aider, and the agent modes built into modern editors are examples of this rung — names aside, the defining trait is that it acts.
Background agents. You hand off a whole task — an issue, a ticket — and it works asynchronously, often remotely, then comes back with a pull request. The unit of delegation is no longer a turn in a conversation; it's an entire piece of work.
The line that matters runs between the second rung and the third. Below it, the tool predicts or answers and you do the integrating. Above it, the tool acts and verifies, and your job shifts to review. Everything under the line makes typing faster. Everything over it changes what you delegate.
What "agentic" really buys you
The word "agentic" gets sprayed on everything, so it's worth stripping down. Three things turn a chatbot into an agent.
Tools. It can do things, not just say them — read files, run the test suite, grep the codebase, execute code, call an API. A model with tools can observe reality instead of narrating a guess about it.
A loop. It acts, sees what happened, decides the next step, and repeats. The loop is where the leverage lives: run the test, read the failure, fix it, run it again — the exact cycle you'd otherwise drive by hand.
Verification. The loop is only as good as the agent's ability to check itself. An agent that can run your tests, type-checker, or linter has a ground truth to move toward. One that can't is just generating confident text and hoping. This is the single best predictor of whether an agent ships working code: can it tell, on its own, whether it succeeded?
So "agentic" isn't a measure of intelligence. It's whether the feedback loop is closed. That's also why the same model feels brilliant in one repo and useless in another — give it tools and fast, trustworthy checks and it converges; take those away and it flails.
Where they shine, and where they bite
The capability that makes these tools powerful is also what makes them fail in predictable ways.
They shine on work that's scoped and verifiable: make this failing test pass, port this module to a new API, add a field end-to-end, write tests against a spec, grind through boilerplate that has a clear pattern to copy. Anything with a fast, automated check the agent can run itself is squarely in the strike zone.
They bite when the goal is fuzzy or the check is missing. Hand an agent a vague task with no definition of done and it will do something, confidently — and you'll spend longer reviewing the result than you'd have spent doing it yourself. The same goes for work where success is slow or manual to judge: subtle UX, performance with no benchmark, "make it feel right." And a large change with weak tests is the genuinely dangerous case — the agent moves fast, and fast without a net is how you ship a confident mess.
The skill that separates good results from bad ones is scoping: shrink the task until success is checkable. A good prompt is mostly a good definition of done.
A workflow that compounds
Keeping a human in the loop is easy to get wrong in two opposite directions: rubber-stamping diffs you never really read, or babysitting every keystroke until the tool is slower than doing it yourself. The path between them is a handful of habits.
Work in small, reviewable units. One task, one branch, a diff you can actually hold in your head. Large autonomous changes are where trust quietly breaks.
Invest in the checks once. Tests, types, lint, a single green-gate command — every check you automate is leverage the agent reuses on every future task. The repo that's pleasant for an agent is the same one that's pleasant for a human: clear structure, fast tests, honest error messages.
Review the diff, not the conversation. The artifact is the code. Read it the way you'd read a pull request from a fast junior who never tires and never remembers yesterday.
Keep context tight. Point the tool at the right files instead of dumping the whole repo at it. Good boundaries — small files, clear interfaces — help an agent for exactly the same reason they help you.
The reason this compounds: each habit is an investment that pays out on every task after it. Better tests produce better agent output, which makes review faster, which lets you delegate more. The flywheel is real — but you have to build it. The same discipline runs through the open-source AI stack we'd actually build on: the boring parts, done well, are what make the impressive parts possible.
The cache
A few things worth keeping, the way we keep everything else here — examples, not endorsements.
- Pick one agentic CLI and learn it deeply. Claude Code, Aider, and the OpenAI Codex CLI are all reasonable starting points. Fluency in one beats a shallow tour of all three.
- Read the workflow writing, skip the hype. The most useful material on agents is about how to work — scoping, review, evals — not benchmark victory laps. The engineering posts from the model labs and a few practitioners clear that bar; most launch threads don't.
- Watch the verification story. When a new tool shows up, the first question worth asking isn't how smart it is — it's how it checks its own work. That answer tells you more than any demo.
The tools will keep changing, and some of the names above will age. The shape won't. Give a capable model tools, a loop, and a way to check itself, then hand it work whose success you can verify. Everything else is detail.
That's the whole field guide. Now go scope something small and hand it off.