Blog
Engineering

How autonomous AI engineers actually work (and where they fail)

A practical look at the loop behind AI coding agents: issue pickup, planning, edits, tests, retries, PRs, handoff, and the failure modes teams should expect.

6 min read

Most autonomous AI engineer demos skip the part that decides whether the tool works in a real repo. The demo shows a clean ticket, a clean branch, and a PR that passes on the first try. Real use starts later, when tests are flaky, dependencies are old, the issue is vague, and the model chooses the wrong file before it chooses the right one.

An agent such as Codowave is less mysterious when you look at the loop. It claims an issue, studies the repo, edits files, runs checks, retries failures, opens a PR, and hands the work back to a human or to CI. The model matters, but the loop around the model decides most of the outcome — strong scaffolding consistently produces better PRs regardless of which Claude variant is behind the inference call.

The loop in seven steps

1. Pickup
2. Plan
3. Edit
4. Test
5. Iterate
6. Open PR
7. Hand off

Every coding agent uses some version of this sequence. The differences are in what context the agent reads, what tools it can call, how it treats failures, and when it stops.

Pickup

The agent starts by claiming work from a queue. In Codowave, that queue is made of tracker tickets marked ready for work. Scheduled scanners can add issues too, for example dependency cleanup or stale documentation.

The common failure at pickup is ambiguity. Refactor auth is not a task. It is a topic. A good agent asks for clarification, splits the work, or returns the issue to a human. A bad agent writes a large PR and hopes review will sort it out.

Plan

Planning is where the agent earns the right to edit. It needs enough repo context to know which files matter and which files should be left alone. Reading too little leads to shallow changes. Reading too much burns tokens and distracts the model.

Codowave does a static analysis pass before the worker model sees the issue. The system indexes symbols, file relationships, package boundaries, and likely entry points. The planner then selects a narrow set of files for the task. That leaves more context budget for the actual change.

Planning also sets scope. If the issue is too large, Codowave can split it with auto-decomposition before code is written. That prevents one broad request from turning into one risky PR.

Edit

Editing is the part people picture first, but it is rarely the hardest part. When the plan is good, the model has a clear job: change these files, preserve unrelated behavior, and stay inside the issue.

The main edit failure is drift. The agent fixes the bug and also formats a neighboring file. It renames a helper nobody asked it to touch. It rewrites an error message that a metric depends on. Codowave treats those changes as scope violations. The diff should match the plan, not the model's taste.

Test

Testing is where weak agents break. Real repos have local setup steps, pinned Node versions, missing secrets, flaky tests, and package manager edge cases. A coding agent that cannot run checks is only guessing.

Codowave tries to mirror the repo's expected runtime. It reads files such as package.json, .tool-versions, .python-version, Dockerfiles, and CI YAML. Then it runs the checks available to the repo. When a check fails, the failure signature is stored for the current run so the agent does not repeat the same diagnosis.

Iterate

Retries separate useful agents from code generators. A failing test can mean the agent broke something, the test already failed on main, the environment is missing a secret, or the test flakes once every few runs.

The agent has to ask a boring question first: does this fail without my change? If yes, the PR description should say so and the agent should avoid patching unrelated failures. If no, the failure belongs to the branch and the agent should fix it before review.

Retry limits matter too. Infinite retries waste money and make the final diff worse. Codowave stops after a bounded number of attempts and reports what it tried.

Open PR

The PR is the handoff artifact. A useful agent PR explains what changed, why the change was made, what files were touched, what checks ran, and what stayed out of scope.

That last part is underrated. Reviewers often reject agent work because they cannot tell whether missing work was forgotten or deliberately excluded. A short "not included" section makes review faster and keeps the issue boundary visible.

Hand off

After the PR opens, the workflow returns to the team. New repos usually need human review. Mature repos can enable tighter automation once the team trusts the agent's behavior and CI coverage.

Hand-off is also where the agent learns practical repo facts. If reviewers keep asking for the same convention, that convention should become memory or a repo rule. If reviewers keep rejecting broad PRs, the issue queue needs better scope.

The failure modes to expect

Most agent failures are ordinary engineering failures. The runtime does not match CI. A test needs a secret that is not in the repo. The issue describes three tasks as one. A dependency upgrade changes a type. A remembered file path is stale.

These failures are not glamorous, but they decide whether the product is useful. The model's ability to write a clever line of code matters less than the system's ability to install dependencies, run checks, notice bad scope, and hand reviewers a readable diff.

The best teams improve the loop from both sides. They give the agent smaller issues with clearer acceptance criteria. The agent gives them PRs with better tests, better descriptions, and explicit limits.

What autonomy buys

Autonomy does not mean no human judgment. It means the team can move work from "someone should eventually do this" to "a PR exists and needs review." That is a smaller, easier decision.

The gains show up in maintenance work first. Dependency bumps, doc updates, test fixes, lint cleanup, stale TODOs, and small product bugs often sit because starting them costs more attention than the work itself. An agent removes that startup cost. Humans still review the result.

For larger work, autonomy helps only when scope is controlled. A broad feature request still needs human design. A well-scoped issue can move through the loop while the team works on harder problems.

The takeaway

An autonomous AI engineer is a loop, not a magic model call. Pickup, planning, editing, testing, retries, PRs, and handoff all need guardrails. If those parts are weak, a better model only produces larger mistakes. If those parts are strong, the agent can turn real tracker tickets into reviewable PRs without a developer sitting beside it — see the weekend case study for what that looks like across 10 real backlog tickets, or compare the agent surface against Cursor, Devin, Sweep, and Cline to pick the right tool for the job.