DeepSWE: A Benchmark for Long-Horizon Coding Agents

whackur
Ai
June 30, 2026

Table of Contents

SWE-bench has been the default coding-agent leaderboard for a while, but it has well-known weaknesses. Most tasks come from existing public issues and PR patches, so a high score might partly reflect memorization. Most tasks are also single-file bug fixes, which is not representative of the multi-file, long-horizon work that a coding agent does in practice.

DeepSWE from DataCurve AI tries a different approach. Tasks and reference solutions are written from scratch, scoring is behavior-based rather than patch-matching, and all leaderboard runs use a fixed agent harness so model comparisons stay consistent. Despite the name, DeepSWE is not a model or training recipe: per the official README and site, it is a benchmark and leaderboard.

Task composition

DeepSWE has 113 tasks across five languages: TypeScript (35), Go (34), Python (34), JavaScript (5), Rust (5). The official site lists 91 source repositories. Most tasks (106 of 113) are classified as feature requests rather than bug fixes. That is a meaningful difference from SWE-bench Verified, which skews toward bug fixes.

Harbor task format

Each task follows the Harbor framework layout:

task.toml         Metadata: repo URL, base commit, language, image, limits
instruction.md    The prompt the agent receives
pre_artifacts.sh  Extracts the agent's commits as a patch
environment/      Dockerfile for the prebuilt environment
tests/            Verifier entry point, held-out tests, grading config
solution/         Reference solution, hidden from the agent

Every task has a 5,400-second (90-minute) agent timeout. Internet access is blocked (allow_internet = false).

One example task gives a sense of the difficulty level. happy-dom-abort-pending-body-reads targets the TypeScript repo capricorn86/happy-dom. The instruction asks for correct abort/cleanup semantics when a shutdown interrupts Request/Response body consumption, formData parsing, and timers. This is not a function to implement in isolation; it requires understanding how multiple components interact and changing behavior across several files.

How scoring works

Behavior-based verification

This is the sharpest departure from SWE-bench. DeepSWE’s verifier does not compare the submitted patch against a reference solution. It runs tests in a separate container to check whether the agent’s changes produce the behavior described in the instruction. A solution with a different internal structure still passes if the observable behavior is correct.

The solution/ reference answer is never shown to the agent and is not used during grading. It exists for offline correctness spot-checks by reviewers.

Separate verifier environment

Since v1.1, scoring uses Harbor’s separate verifier environment:

The agent modifies code in an isolated container and commits.
pre_artifacts.sh extracts those commits as a patch.
The patch is applied to a fresh container.
The verifier runs tests and grades from that clean state.
Results go into reward.json, ctrf.json, run.log, and related files.

This separation prevents the agent from polluting the verification environment and makes every run reproducible from the patch alone.

Pier and the network allowlist

DataCurve’s Pier is a Harbor-compatible runner that forks Harbor and adds per-agent network allowlists. The practical problem it solves: with allow_internet = false, a fully offline container also blocks LLM API calls and dependency installs that the agent scaffold needs. Pier isolates the task environment while allowing only the network traffic the agent requires.

mini-swe-agent as the fixed harness

Every leaderboard run uses mini-swe-agent as the agent scaffold, with only the model swapped. The leaderboard compares models under the same scaffold. It is not a head-to-head comparison of Claude Code versus Codex as finished products. That distinction matters when reading the numbers.

Leaderboard snapshot (June 2026)

From the official site; all runs on mini-swe-agent:

Model	Score
gpt-5.5 [xhigh]	70% ± 3%
claude-opus-4.8 [max]	58% ± 2%
gpt-5.4 [xhigh]	56% ± 2%
claude-opus-4.7 [max]	54% ± 5%
claude-sonnet-4.6 [high]	32% ± 2%
gemini-3.5-flash [medium]	28% ± 4%
deepseek-v4-pro	8% ± 3%

The reasoning budget tier in brackets (xhigh, max, high, medium) affects scores. The same model at a lower budget would score lower.

Comparison with SWE-bench

From the official blog:

Metric	SWE-Bench Verified	SWE-Bench Pro	DeepSWE
Avg. prompt length	1,700 chars	4,614 chars	2,158 chars
Avg. lines added (reference solution)	10	120	668
Avg. files edited	1	5	7

The prompts are shorter than SWE-Bench Pro, but the expected change size is far larger. DeepSWE is designed to test whether an agent can navigate a codebase and produce a substantial change from a short behavioral description, rather than implement a long specification directly.

Contamination claims and their limits

The official blog says tasks were written from scratch, not derived from existing public issue/PR/commit data. Since the reference solution is not used for grading, even if a model had seen it during pretraining, passing the verifier still requires producing the correct behavior.

That said, “contamination-free” is a design claim from DataCurve, not independently verified by a third party. Once tasks are public, they become potential training data for future models. These claims are best understood relative to a publication date.

Running it yourself

From the Run page and README:

Before running, configure your provider’s API key as an environment variable according to the provider’s official documentation (e.g., Anthropic, OpenAI).

git clone https://github.com/datacurve-ai/deep-swe
uv tool install datacurve-pier

# Full run with Claude Opus 4.8
pier run -p deep-swe/tasks --agent mini-swe-agent --model anthropic/claude-opus-4-8

# Subset run (10 tasks, fixed seed)
pier run -p deep-swe/tasks --agent mini-swe-agent --n-tasks 10 --sample-seed 0

# Single task
pier run -p deep-swe/tasks/<task-id> --agent mini-swe-agent

# Parallel run via Modal
pier run -p deep-swe/tasks --agent mini-swe-agent --model <provider/model> --env modal

References

DeepSWE official site: DataCurve AI, accessed 2026-06-30
datacurve-ai/deep-swe (GitHub): README and task structure specification
DeepSWE blog: official blog, SWE-bench comparison figures
DeepSWE Run page: Pier installation, network allowlist details
Harbor Framework Docs: task format specification, accessed 2026-06-30
SWE-agent/mini-swe-agent (GitHub): agent harness, accessed 2026-06-30

DeepSWE: A Benchmark for Long-Horizon Coding Agents

Task composition

Harbor task format

How scoring works

Behavior-based verification

Separate verifier environment

Pier and the network allowlist

mini-swe-agent as the fixed harness

Leaderboard snapshot (June 2026)

Comparison with SWE-bench

Contamination claims and their limits

Running it yourself

Further reading

References

Tags :

Share :

Related Posts

Future AGI: Evaluate, Observe, and Improve AI Agents in One Place

Mixture of Agents: How Layering Open-Source LLMs Beat GPT-4 Omni