DeepSWE: A Benchmark for Long-Horizon Coding Agents

Table of Contents

SWE-bench has been the default coding-agent leaderboard for a while, but it has well-known weaknesses. Most tasks come from existing public issues and PR patches, so a high score might partly reflect memorization. Most tasks are also single-file bug fixes, which is not representative of the multi-file, long-horizon work that a coding agent does in practice.

DeepSWE from DataCurve AI tries a different approach. Tasks and reference solutions are written from scratch, scoring is behavior-based rather than patch-matching, and all leaderboard runs use a fixed agent harness so model comparisons stay consistent. Despite the name, DeepSWE is not a model or training recipe: per the official README and site, it is a benchmark and leaderboard.

Task composition

DeepSWE has 113 tasks across five languages: TypeScript (35), Go (34), Python (34), JavaScript (5), Rust (5). The official site lists 91 source repositories. Most tasks (106 of 113) are classified as feature requests rather than bug fixes. That is a meaningful difference from SWE-bench Verified, which skews toward bug fixes.

Harbor task format

Each task follows the Harbor framework layout:

task.toml         Metadata: repo URL, base commit, language, image, limits
instruction.md    The prompt the agent receives
pre_artifacts.sh  Extracts the agent's commits as a patch
environment/      Dockerfile for the prebuilt environment
tests/            Verifier entry point, held-out tests, grading config
solution/         Reference solution, hidden from the agent

Every task has a 5,400-second (90-minute) agent timeout. Internet access is blocked (allow_internet = false).

One example task gives a sense of the difficulty level. happy-dom-abort-pending-body-reads targets the TypeScript repo capricorn86/happy-dom. The instruction asks for correct abort/cleanup semantics when a shutdown interrupts Request/Response body consumption, formData parsing, and timers. This is not a function to implement in isolation; it requires understanding how multiple components interact and changing behavior across several files.

How scoring works

Behavior-based verification

This is the sharpest departure from SWE-bench. DeepSWE’s verifier does not compare the submitted patch against a reference solution. It runs tests in a separate container to check whether the agent’s changes produce the behavior described in the instruction. A solution with a different internal structure still passes if the observable behavior is correct.

The solution/ reference answer is never shown to the agent and is not used during grading. It exists for offline correctness spot-checks by reviewers.

Separate verifier environment

Since v1.1, scoring uses Harbor’s separate verifier environment:

  1. The agent modifies code in an isolated container and commits.
  2. pre_artifacts.sh extracts those commits as a patch.
  3. The patch is applied to a fresh container.
  4. The verifier runs tests and grades from that clean state.
  5. Results go into reward.json, ctrf.json, run.log, and related files.

This separation prevents the agent from polluting the verification environment and makes every run reproducible from the patch alone.

Pier and the network allowlist

DataCurve’s Pier is a Harbor-compatible runner that forks Harbor and adds per-agent network allowlists. The practical problem it solves: with allow_internet = false, a fully offline container also blocks LLM API calls and dependency installs that the agent scaffold needs. Pier isolates the task environment while allowing only the network traffic the agent requires.

mini-swe-agent as the fixed harness

Every leaderboard run uses mini-swe-agent as the agent scaffold, with only the model swapped. The leaderboard compares models under the same scaffold. It is not a head-to-head comparison of Claude Code versus Codex as finished products. That distinction matters when reading the numbers.

Leaderboard snapshot (June 2026)

From the official site; all runs on mini-swe-agent:

ModelScore
gpt-5.5 [xhigh]70% ± 3%
claude-opus-4.8 [max]58% ± 2%
gpt-5.4 [xhigh]56% ± 2%
claude-opus-4.7 [max]54% ± 5%
claude-sonnet-4.6 [high]32% ± 2%
gemini-3.5-flash [medium]28% ± 4%
deepseek-v4-pro8% ± 3%

The reasoning budget tier in brackets (xhigh, max, high, medium) affects scores. The same model at a lower budget would score lower.

Comparison with SWE-bench

From the official blog:

MetricSWE-Bench VerifiedSWE-Bench ProDeepSWE
Avg. prompt length1,700 chars4,614 chars2,158 chars
Avg. lines added (reference solution)10120668
Avg. files edited157

The prompts are shorter than SWE-Bench Pro, but the expected change size is far larger. DeepSWE is designed to test whether an agent can navigate a codebase and produce a substantial change from a short behavioral description, rather than implement a long specification directly.

Contamination claims and their limits

The official blog says tasks were written from scratch, not derived from existing public issue/PR/commit data. Since the reference solution is not used for grading, even if a model had seen it during pretraining, passing the verifier still requires producing the correct behavior.

That said, “contamination-free” is a design claim from DataCurve, not independently verified by a third party. Once tasks are public, they become potential training data for future models. These claims are best understood relative to a publication date.

Running it yourself

From the Run page and README:

Before running, configure your provider’s API key as an environment variable according to the provider’s official documentation (e.g., Anthropic, OpenAI).

git clone https://github.com/datacurve-ai/deep-swe
uv tool install datacurve-pier

# Full run with Claude Opus 4.8
pier run -p deep-swe/tasks --agent mini-swe-agent --model anthropic/claude-opus-4-8

# Subset run (10 tasks, fixed seed)
pier run -p deep-swe/tasks --agent mini-swe-agent --n-tasks 10 --sample-seed 0

# Single task
pier run -p deep-swe/tasks/<task-id> --agent mini-swe-agent

# Parallel run via Modal
pier run -p deep-swe/tasks --agent mini-swe-agent --model <provider/model> --env modal

Further reading

References

Share :

Related Posts

Future AGI: Evaluate, Observe, and Improve AI Agents in One Place

If you have shipped an AI agent, this will sound familiar. The demo runs fine. Then it hits production, the hallucinations start, and you can’t tell what went wrong or why. So you bolt on one tool for evals, another for tracing, another for guardrails. The real problem is that none of them talk to each other, so the loop you need to actually fix things never closes.

Read More

Mixture of Agents: How Layering Open-Source LLMs Beat GPT-4 Omni

Instead of scaling a single model up, what happens when you stack multiple models in layers and have each one refine the previous layer’s output? Together AI’s research team answered that in June 2024 with arXiv:2406.04692. Using only open-source models, their Mixture of Agents (MoA) configuration scored 65.1% on AlpacaEval 2.0, versus 57.5% for GPT-4 Omni.

Read More