Blog
Engineering

How to read AI coding agent benchmarks without fooling yourself

SWE-bench scores sell tools but rarely predict them. A practical guide to what coding-agent benchmarks measure, where they mislead, and the metric that matters.

8 min read

Every coding-agent launch in 2026 comes with a number, and the number is almost always a SWE-bench score. It's a percentage, it's bigger than last quarter's, and it's printed in the largest font the landing page allows. The number is real, the benchmark is serious work, and the score still tells you surprisingly little about whether the tool will close your tickets. Learning to read these numbers — what they measure, where they quietly mislead, and which metric actually predicts production — is the difference between buying a benchmark and buying a tool.

We run these evaluations internally and we've been burned by misreading them, so this is a practical guide rather than a takedown. Benchmarks are useful. They're just useful for a narrower thing than the marketing implies.

What SWE-bench actually measures

SWE-bench is a set of real GitHub issues from open-source Python projects, each paired with the human pull request that resolved it and the tests that PR had to pass. An agent is handed the repository and the issue text, told to produce a patch, and scored on whether the project's existing tests go from red to green. SWE-bench Verified is a human-filtered subset where the problems are confirmed solvable and the tests confirmed fair, which is why most credible reports quote the Verified number.

That design has real virtues. The tasks are authentic, the grading is objective, and "did the tests pass" is a much harder thing to game than a model judging its own prose. When a tool moves from 40% to 70% on Verified, something genuine improved.

But notice what's baked in. The issues are public, which means they may sit in training data. They're Python. They come with a test suite that already encodes the right answer. And the agent gets one well-scoped problem at a time, in a repo the benchmark chose. Every one of those is different from your Tuesday.

The four ways a score misleads

Contamination. SWE-bench draws from public repositories, and frontier models trained after those issues were filed may have seen the fix. The benchmark authors work hard to mitigate this, but a score lifted partly by memorization won't transfer to your private codebase, which no model has ever seen.

The harness is half the result. A "SWE-bench score" is never just the model. It's the model plus the scaffolding around it — how the agent retrieves context, how it plans, how many times it retries, how it runs tests. The same model can swing twenty points depending on the harness. When you compare two scores from two vendors, you're comparing two harnesses at least as much as two models, and the harness you'll actually run may be neither.

One language, one shape. SWE-bench is Python. If your work is a TypeScript monorepo, a Go service, or a Rails app with a decade of conventions, a Python score is a proxy for a proxy. Agents that shine on clean library code can flounder on a build system held together by institutional memory.

Pass@k inflation. Some reported numbers are pass@k — the agent gets k attempts and counts a win if any attempt passes. That's a legitimate measure of potential, but it's not how you experience the tool. You experience pass@1: the first PR it opens. A great pass@5 with a mediocre pass@1 means a lot of review churn for you.

A benchmark tells you what an agent can do on a problem someone else chose. It says almost nothing about what it will do on the problem in front of you.

The metric that actually predicts production

The number we watch internally isn't a public benchmark at all. It's merge rate on your own repository: of the PRs the agent opens against your code, what fraction get merged with no human edits. It's unglamorous, it's specific to you, and it's the only figure that survives contact with your conventions, your tests, your reviewers, and your definition of done.

Merge rate is honest in a way SWE-bench can't be. It can't be contaminated, because the issues are yours. It can't hide behind a harness, because it's the harness you're running. And it measures pass@1 by construction — a PR that needed three rounds of review fixes wasn't merged clean, and merge rate counts it accordingly.

The catch is that you can't read it off a landing page. You have to generate it, which means running a real evaluation on a real slice of your backlog. That's exactly why a concrete free tier matters more than a benchmark chart: three real issues run against your repo tell you more than any percentage a vendor can print, because they produce your number, not theirs.

How to run an evaluation that isn't theater

If you're comparing agents seriously, a weekend of structured testing beats a month of reading benchmarks. A few rules keep it honest.

Pick ten issues you've already closed, so you have a human baseline to compare against. Mix the difficulty — a couple of one-line fixes, a few medium features, one genuinely hard change — because an agent's average hides its variance, and variance is what hurts in production.

Run each agent on the same ten issues, untouched, and score pass@1: merged clean, merged after edits, or rejected. Resist the urge to coach. The whole point is to see what it does when you're not steering, because that's the mode you're buying.

Then count the part nobody benchmarks: review time. An agent that opens a plausible-looking PR you have to read for twenty minutes to trust is not obviously better than no agent. The win condition isn't a patch that passes tests; it's a patch you'd approve as fast as a teammate's.

Where benchmarks are genuinely useful

None of this means ignore them. A SWE-bench Verified score is a fine coarse filter — it'll tell you which tools are in the conversation and which aren't, and a tool that scores near zero isn't worth your weekend. Trends over time within a single vendor's harness are meaningful, because the harness is held constant. And the benchmark community has pushed the whole field forward by making "did the tests pass on a real issue" the bar, which is a much better bar than the demos that preceded it.

Use the score to build the shortlist. Use your own merge rate to make the decision. The vendors with the highest numbers want you to do the first and skip the second, because the second is where their number meets your repo and the two don't always agree.

The takeaway

A benchmark is a map drawn by someone else, of terrain they chose, at a scale that flatters them. It's worth reading. It is not the territory. The territory is your repository, your conventions, and the PRs an agent opens against them — and the only number that measures the territory is the one you generate yourself. Build the shortlist from the charts. Buy the tool from the merge rate.


Frequently asked questions