DeepSWE: A Benchmark for Long-Horizon Coding Agents
Table of Contents
SWE-bench has been the default coding-agent leaderboard for a while, but it has well-known weaknesses. Most tasks come from existing public issues and PR patches, so a high score might partly reflect memorization. Most tasks are also single-file bug fixes, which is not representative of the multi-file, long-horizon work that a coding agent does in practice.
DeepSWE from DataCurve AI tries a different approach. Tasks and reference solutions are written from scratch, scoring is behavior-based rather than patch-matching, and all leaderboard runs use a fixed agent harness so model comparisons stay consistent. Despite the name, DeepSWE is not a model or training recipe: per the official README and site, it is a benchmark and leaderboard.
Task composition
DeepSWE has 113 tasks across five languages: TypeScript (35), Go (34), Python (34), JavaScript (5), Rust (5). The official site lists 91 source repositories. Most tasks (106 of 113) are classified as feature requests rather than bug fixes. That is a meaningful difference from SWE-bench Verified, which skews toward bug fixes.
Harbor task format
Each task follows the Harbor framework layout:
task.toml Metadata: repo URL, base commit, language, image, limits
instruction.md The prompt the agent receives
pre_artifacts.sh Extracts the agent's commits as a patch
environment/ Dockerfile for the prebuilt environment
tests/ Verifier entry point, held-out tests, grading config
solution/ Reference solution, hidden from the agent
Every task has a 5,400-second (90-minute) agent timeout. Internet access is blocked (allow_internet = false).
One example task gives a sense of the difficulty level. happy-dom-abort-pending-body-reads targets the TypeScript repo capricorn86/happy-dom. The instruction asks for correct abort/cleanup semantics when a shutdown interrupts Request/Response body consumption, formData parsing, and timers. This is not a function to implement in isolation; it requires understanding how multiple components interact and changing behavior across several files.
How scoring works
Behavior-based verification
This is the sharpest departure from SWE-bench. DeepSWE’s verifier does not compare the submitted patch against a reference solution. It runs tests in a separate container to check whether the agent’s changes produce the behavior described in the instruction. A solution with a different internal structure still passes if the observable behavior is correct.
The solution/ reference answer is never shown to the agent and is not used during grading. It exists for offline correctness spot-checks by reviewers.
Separate verifier environment
Since v1.1, scoring uses Harbor’s separate verifier environment:
- The agent modifies code in an isolated container and commits.
pre_artifacts.shextracts those commits as a patch.- The patch is applied to a fresh container.
- The verifier runs tests and grades from that clean state.
- Results go into
reward.json,ctrf.json,run.log, and related files.
This separation prevents the agent from polluting the verification environment and makes every run reproducible from the patch alone.
Pier and the network allowlist
DataCurve’s Pier is a Harbor-compatible runner that forks Harbor and adds per-agent network allowlists. The practical problem it solves: with allow_internet = false, a fully offline container also blocks LLM API calls and dependency installs that the agent scaffold needs. Pier isolates the task environment while allowing only the network traffic the agent requires.
mini-swe-agent as the fixed harness
Every leaderboard run uses mini-swe-agent as the agent scaffold, with only the model swapped. The leaderboard compares models under the same scaffold. It is not a head-to-head comparison of Claude Code versus Codex as finished products. That distinction matters when reading the numbers.
Leaderboard snapshot (June 2026)
From the official site; all runs on mini-swe-agent:
| Model | Score |
|---|---|
| gpt-5.5 [xhigh] | 70% ± 3% |
| claude-opus-4.8 [max] | 58% ± 2% |
| gpt-5.4 [xhigh] | 56% ± 2% |
| claude-opus-4.7 [max] | 54% ± 5% |
| claude-sonnet-4.6 [high] | 32% ± 2% |
| gemini-3.5-flash [medium] | 28% ± 4% |
| deepseek-v4-pro | 8% ± 3% |
The reasoning budget tier in brackets (xhigh, max, high, medium) affects scores. The same model at a lower budget would score lower.
Comparison with SWE-bench
From the official blog:
| Metric | SWE-Bench Verified | SWE-Bench Pro | DeepSWE |
|---|---|---|---|
| Avg. prompt length | 1,700 chars | 4,614 chars | 2,158 chars |
| Avg. lines added (reference solution) | 10 | 120 | 668 |
| Avg. files edited | 1 | 5 | 7 |
The prompts are shorter than SWE-Bench Pro, but the expected change size is far larger. DeepSWE is designed to test whether an agent can navigate a codebase and produce a substantial change from a short behavioral description, rather than implement a long specification directly.
Contamination claims and their limits
The official blog says tasks were written from scratch, not derived from existing public issue/PR/commit data. Since the reference solution is not used for grading, even if a model had seen it during pretraining, passing the verifier still requires producing the correct behavior.
That said, “contamination-free” is a design claim from DataCurve, not independently verified by a third party. Once tasks are public, they become potential training data for future models. These claims are best understood relative to a publication date.
Running it yourself
Before running, configure your provider’s API key as an environment variable according to the provider’s official documentation (e.g., Anthropic, OpenAI).
git clone https://github.com/datacurve-ai/deep-swe
uv tool install datacurve-pier
# Full run with Claude Opus 4.8
pier run -p deep-swe/tasks --agent mini-swe-agent --model anthropic/claude-opus-4-8
# Subset run (10 tasks, fixed seed)
pier run -p deep-swe/tasks --agent mini-swe-agent --n-tasks 10 --sample-seed 0
# Single task
pier run -p deep-swe/tasks/<task-id> --agent mini-swe-agent
# Parallel run via Modal
pier run -p deep-swe/tasks --agent mini-swe-agent --model <provider/model> --env modal
Further reading
- SWE-bench: the established coding-agent benchmark DeepSWE compares against
- SWE-agent/mini-swe-agent (GitHub): the model-agnostic harness used for all DeepSWE leaderboard runs
- Harbor Framework: the task format DeepSWE is built on
- LiveCodeBench: a coding benchmark that continuously collects new problems to limit contamination
References
- DeepSWE official site: DataCurve AI, accessed 2026-06-30
- datacurve-ai/deep-swe (GitHub): README and task structure specification
- DeepSWE blog: official blog, SWE-bench comparison figures
- DeepSWE Run page: Pier installation, network allowlist details
- Harbor Framework Docs: task format specification, accessed 2026-06-30
- SWE-agent/mini-swe-agent (GitHub): agent harness, accessed 2026-06-30