SWEbench User Guide
SWE-bench is a benchmark for evaluating how well large language models solve real-world software issues collected from GitHub. Given a repository and an issue, the model is expected to generate a patch that fixes the described problem.
1. Feature Overview
ais_bench currently supports the following SWEbench capabilities:
- Datasets:
full,verified,lite,multilingual - Tasks:
infer: callmini-swe-agentto generate patches (model_patch)eval: call the SWE-bench harness to run evaluation and count resolved instances
- Result summary: output key metrics such as
accuracy,submitted_accuracy, andresolved_instances
This directory already provides 4 example configs:
mini_swe_agent_swe_bench_lite.pymini_swe_agent_swe_bench_verified.pymini_swe_agent_swe_bench_full.pymini_swe_agent_swe_bench_multilingual.py
2. Prerequisites
Before running, make sure the following dependencies are available:
- Install
mini-swe-agent(required for infer)
pip install mini-swe-agent
- Install the SWE-bench harness (required for eval)
git clone https://github.com/SWE-bench/SWE-bench.git
cd SWE-bench
pip install -e .
cd -
- Docker is available (both infer and eval depend on containerized environments)
docker --version
docker ps
3. Minimal Configuration (Run First, Tune Later)
It is recommended to start from mini_swe_agent_swe_bench_lite.py and only modify the three fields in models[0]:
model: model name (required)url: model service endpoint (OpenAI-compatible API)api_key: service key (useEMPTYfor local services)
Example (local vLLM setup):
models = [
dict(
attr="local",
abbr="swebench",
type="LiteLLMChat",
model="qwen3",
api_key="EMPTY",
url="http://127.0.0.1:8000/v1",
batch_size=1,
generation_kwargs=dict(),
)
]
Dataset Path Notes
In the example configs, path="" by default, which means online loading from Hugging Face is preferred.
- You can keep
path=""to fetch data online directly - For offline usage, change
pathto a local parquet file or directory (supportsdata/<split>-*.parquet)
First-Run Recommendations
- Start with the
litedataset - Use
batch_size=1 - Keep
step_limit=200(default in examples; do not change initially)
4. Run Commands
Run the following in the repository root (config is the config file path):
ais_bench ais_bench/configs/swe_bench_examples/mini_swe_agent_swe_bench_lite.py
The command above runs the full pipeline (all). You can also run it step by step:
# Inference only, generate predictions
ais_bench ais_bench/configs/swe_bench_examples/mini_swe_agent_swe_bench_lite.py -m infer
# Evaluate based on existing predictions
ais_bench ais_bench/configs/swe_bench_examples/mini_swe_agent_swe_bench_lite.py -m eval
Resume from Checkpoint
Use --reuse to skip completed instances, which is useful after interruptions:
ais_bench ais_bench/configs/swe_bench_examples/mini_swe_agent_swe_bench_lite.py -m infer --reuse
5. How to Read Outputs
The default output directory is outputs/default/<timestamp>/. Focus on:
- Inference outputs:
predictions/swebench/swebench_*.json- Each
instance_idcontainsmodel_patch
- Evaluation outputs:
results/swebench/swebench_*.json- Key fields:
accuracy:resolved_instances / total_instancessubmitted_accuracy:resolved_instances / submitted_instancesresolved_instances/unresolved_instances/error_instancesharness_exit_code: harness exit code
6. Common Issues and Troubleshooting (SWEB Error Codes)
The following error codes come from SWEB_CODES. You can also refer to the full FAQ:
- Chinese FAQ:
docs/source_zh_cn/faqs/error_codes.md
1) SWEB-DEPENDENCY-001: Missing mini-swe-agent
- Symptom: infer fails to start with dependency import errors
- Cause:
mini-swe-agentis not installed - Fix: run
pip install mini-swe-agent
2) SWEB-DEPENDENCY-002: Missing SWE-bench harness
- Symptom: harness import error during eval
- Cause: SWE-bench is not installed, or not visible in the current environment
- Fix: install SWE-bench as described in "Prerequisites", and make sure you are using the same Python environment
3) SWEB-PARAM-001: Empty model configuration
- Symptom: prompt indicates model is not configured
- Cause:
models[0]['model']is empty or only whitespace - Fix: configure
model/url/api_key, and ensuremodelis non-empty at minimum
4) SWEB-DATA-002 / SWEB-FILE-003: Dataset loading failure
- Symptom: online loading fails, or local parquet files cannot be found
- Cause:
- Online mode: network or Hugging Face access issues
- Local mode:
pathdoes not exist, or directory layout does not match split parquet rules
- Fix:
- Switch to local parquet if online loading fails
- Ensure local path follows:
<root>/data/test-*.parquetor<root>/test-*.parquet
5) SWEB-FILE-001: Predictions file not found
- Symptom:
-m evalreports missing predictions - Cause: infer was not run first, or work_dir/reuse points to a different location
- Fix: run
-m inferfirst, and ensure eval and infer use the same config/output directory
6) SWEB-RUNTIME-001 / SWEB-RUNTIME-002: Container or harness runtime failure
- Symptom: Docker image pull failure, or evaluation runtime errors
- Cause: unavailable images, network issues, or insufficient container runtime environment
- Fix:
- Check
docker psfirst - Verify required images can be pulled
- Retry with
--reuseto avoid recomputing completed instances
- Check
7. Advanced Tips (Optional)
- For initial debugging, use
litefirst, then switch toverified/fullafter the pipeline is stable - To reduce empty patches, prioritize improving model capability and agent prompt templates
- During evaluation, focus on
empty_patch_instancesanderror_instances; they are often more actionable thanaccuracyin early iterations