SWE-bench

A benchmark for evaluating AI coding agents against real GitHub issues, measuring whether agents can produce correct fixes.

SWE-bench is a benchmark for evaluating AI coding agents against real GitHub issues. It presents agents with actual bug reports and feature requests from open-source repositories and measures whether the agent can produce a correct fix.

SWE-bench Verified is a curated subset with human-validated solutions. Top scores currently exceed 70% on the verified set. The benchmark has become the standard evaluation for autonomous coding capabilities.

A key finding from SWE-bench research is that scaffolding matters as much as the model. In one test, three different agent frameworks running the same underlying model scored 17 issues apart on 731 problems. The architecture around the model does real work.

In plain English

The standardized test for AI coding agents — it measures how often they can solve real, verified GitHub issues from actual open source projects.

Why it matters

Before SWE-bench, AI coding tool comparisons were marketing demos and cherry-picked examples. SWE-bench tests agents against problems with known correct solutions, requiring working patches rather than plausible-looking code. It gives "72.5% score" a concrete, reproducible meaning — the agent solved 72.5% of the verified test cases correctly.

In practice

SWE-bench presents an agent with a real GitHub issue: a Django bug where filtering by a related field generates incorrect SQL. The agent reads the issue description, navigates the codebase, produces a patch, and the patch is tested against the existing test suite. No partial credit — it either fixes the issue or it does not. The verified subset (SWE-bench Verified) uses 500 human-validated problems to eliminate ambiguous test cases.

How Codegen uses SWE-bench

Codegen runs on Claude, which holds among the highest SWE-bench Verified scores of currently available models. That score is the floor — the baseline reasoning capability the agent brings to any task. Codegen determines what the agent is given to reason about: the task context, the relevant files, the acceptance criteria. A high benchmark score with poor task context produces mediocre output. A high benchmark score with well-structured ClickUp tasks produces consistently useful pull requests.

Frequently Asked Questions

What is SWE-bench?

Why does SWE-bench matter?

What does SWE-bench look like in practice?