SWE-bench is a benchmark for evaluating AI coding agents against real GitHub issues. It presents agents with actual bug reports and feature requests from open-source repositories and measures whether the agent can produce a correct fix.
SWE-bench Verified is a curated subset with human-validated solutions. Top scores currently exceed 70% on the verified set. The benchmark has become the standard evaluation for autonomous coding capabilities.
A key finding from SWE-bench research is that scaffolding matters as much as the model. In one test, three different agent frameworks running the same underlying model scored 17 issues apart on 731 problems. The architecture around the model does real work.
