文件最后提交记录最后更新时间
【Feature】add SWE-bench example configs and bilingual user guide (#191) * util adapter * adapter swebench eval * adapter mini swe agent * add swebench configs --------- Co-authored-by: zhanggaohua@huawei.com <GaoHua>1 个月前
【Feature】add SWE-bench example configs and bilingual user guide (#191) * util adapter * adapter swebench eval * adapter mini swe agent * add swebench configs --------- Co-authored-by: zhanggaohua@huawei.com <GaoHua>1 个月前
【Feature】add SWE-bench example configs and bilingual user guide (#191) * util adapter * adapter swebench eval * adapter mini swe agent * add swebench configs --------- Co-authored-by: zhanggaohua@huawei.com <GaoHua>1 个月前
【Feature】add SWE-bench example configs and bilingual user guide (#191) * util adapter * adapter swebench eval * adapter mini swe agent * add swebench configs --------- Co-authored-by: zhanggaohua@huawei.com <GaoHua>1 个月前
【Feature】add SWE-bench example configs and bilingual user guide (#191) * util adapter * adapter swebench eval * adapter mini swe agent * add swebench configs --------- Co-authored-by: zhanggaohua@huawei.com <GaoHua>1 个月前
【Feature】add SWE-bench example configs and bilingual user guide (#191) * util adapter * adapter swebench eval * adapter mini swe agent * add swebench configs --------- Co-authored-by: zhanggaohua@huawei.com <GaoHua>1 个月前
README_en.md

SWEbench User Guide

SWE-bench is a benchmark for evaluating how well large language models solve real-world software issues collected from GitHub. Given a repository and an issue, the model is expected to generate a patch that fixes the described problem.

1. Feature Overview

ais_bench currently supports the following SWEbench capabilities:

  • Datasets: full, verified, lite, multilingual
  • Tasks:
    • infer: call mini-swe-agent to generate patches (model_patch)
    • eval: call the SWE-bench harness to run evaluation and count resolved instances
  • Result summary: output key metrics such as accuracy, submitted_accuracy, and resolved_instances

This directory already provides 4 example configs:

  • mini_swe_agent_swe_bench_lite.py
  • mini_swe_agent_swe_bench_verified.py
  • mini_swe_agent_swe_bench_full.py
  • mini_swe_agent_swe_bench_multilingual.py

2. Prerequisites

Before running, make sure the following dependencies are available:

  1. Install mini-swe-agent (required for infer)
pip install mini-swe-agent
  1. Install the SWE-bench harness (required for eval)
git clone https://github.com/SWE-bench/SWE-bench.git
cd SWE-bench
pip install -e .
cd -
  1. Docker is available (both infer and eval depend on containerized environments)
docker --version
docker ps

3. Minimal Configuration (Run First, Tune Later)

It is recommended to start from mini_swe_agent_swe_bench_lite.py and only modify the three fields in models[0]:

  • model: model name (required)
  • url: model service endpoint (OpenAI-compatible API)
  • api_key: service key (use EMPTY for local services)

Example (local vLLM setup):

models = [
    dict(
        attr="local",
        abbr="swebench",
        type="LiteLLMChat",
        model="qwen3",
        api_key="EMPTY",
        url="http://127.0.0.1:8000/v1",
        batch_size=1,
        generation_kwargs=dict(),
    )
]

Dataset Path Notes

In the example configs, path="" by default, which means online loading from Hugging Face is preferred.

  • You can keep path="" to fetch data online directly
  • For offline usage, change path to a local parquet file or directory (supports data/<split>-*.parquet)

First-Run Recommendations

  • Start with the lite dataset
  • Use batch_size=1
  • Keep step_limit=200 (default in examples; do not change initially)

4. Run Commands

Run the following in the repository root (config is the config file path):

ais_bench ais_bench/configs/swe_bench_examples/mini_swe_agent_swe_bench_lite.py

The command above runs the full pipeline (all). You can also run it step by step:

# Inference only, generate predictions
ais_bench ais_bench/configs/swe_bench_examples/mini_swe_agent_swe_bench_lite.py -m infer

# Evaluate based on existing predictions
ais_bench ais_bench/configs/swe_bench_examples/mini_swe_agent_swe_bench_lite.py -m eval

Resume from Checkpoint

Use --reuse to skip completed instances, which is useful after interruptions:

ais_bench ais_bench/configs/swe_bench_examples/mini_swe_agent_swe_bench_lite.py -m infer --reuse

5. How to Read Outputs

The default output directory is outputs/default/<timestamp>/. Focus on:

  • Inference outputs:
    • predictions/swebench/swebench_*.json
    • Each instance_id contains model_patch
  • Evaluation outputs:
    • results/swebench/swebench_*.json
    • Key fields:
      • accuracy: resolved_instances / total_instances
      • submitted_accuracy: resolved_instances / submitted_instances
      • resolved_instances / unresolved_instances / error_instances
      • harness_exit_code: harness exit code

6. Common Issues and Troubleshooting (SWEB Error Codes)

The following error codes come from SWEB_CODES. You can also refer to the full FAQ:

  • Chinese FAQ: docs/source_zh_cn/faqs/error_codes.md

1) SWEB-DEPENDENCY-001: Missing mini-swe-agent

  • Symptom: infer fails to start with dependency import errors
  • Cause: mini-swe-agent is not installed
  • Fix: run pip install mini-swe-agent

2) SWEB-DEPENDENCY-002: Missing SWE-bench harness

  • Symptom: harness import error during eval
  • Cause: SWE-bench is not installed, or not visible in the current environment
  • Fix: install SWE-bench as described in "Prerequisites", and make sure you are using the same Python environment

3) SWEB-PARAM-001: Empty model configuration

  • Symptom: prompt indicates model is not configured
  • Cause: models[0]['model'] is empty or only whitespace
  • Fix: configure model/url/api_key, and ensure model is non-empty at minimum

4) SWEB-DATA-002 / SWEB-FILE-003: Dataset loading failure

  • Symptom: online loading fails, or local parquet files cannot be found
  • Cause:
    • Online mode: network or Hugging Face access issues
    • Local mode: path does not exist, or directory layout does not match split parquet rules
  • Fix:
    • Switch to local parquet if online loading fails
    • Ensure local path follows: <root>/data/test-*.parquet or <root>/test-*.parquet

5) SWEB-FILE-001: Predictions file not found

  • Symptom: -m eval reports missing predictions
  • Cause: infer was not run first, or work_dir/reuse points to a different location
  • Fix: run -m infer first, and ensure eval and infer use the same config/output directory

6) SWEB-RUNTIME-001 / SWEB-RUNTIME-002: Container or harness runtime failure

  • Symptom: Docker image pull failure, or evaluation runtime errors
  • Cause: unavailable images, network issues, or insufficient container runtime environment
  • Fix:
    • Check docker ps first
    • Verify required images can be pulled
    • Retry with --reuse to avoid recomputing completed instances

7. Advanced Tips (Optional)

  • For initial debugging, use lite first, then switch to verified/full after the pipeline is stable
  • To reduce empty patches, prioritize improving model capability and agent prompt templates
  • During evaluation, focus on empty_patch_instances and error_instances; they are often more actionable than accuracy in early iterations