# SWE-bench !!! abstract "Overview" * We provide two scripts to run on the [SWE-bench](https://www.swebench.com/) benchmark. * `mini-extra swebench` runs on all task instances in batch mode. * `mini-extra swebench-single` runs on a single task instance with interactivity (useful for debugging). * You can also take a look at the runscripts to figure out how to build your own batch processing pipeline.

## Usage !!! warning "Docker container availability" The docker containers for Linux assume an x86 Linux architecture; you might not be able to run them on other architectures. !!! tip "Quickstart" We provide two different scripts: `swebench` and `swebench-single`: === "Batch mode" Batch mode runs on all task instances in parallel. ```bash mini-extra swebench --help # or python src/minisweagent/run/benchmarks/swebench.py --help # Example: mini-extra swebench \ --model anthropic/claude-sonnet-4-5-20250929 \ --subset verified \ --split test \ --workers 4 ``` Basic flags: - `-o`, `--output` - Output directory - `-m`, `--model` - Model to use - `-c`, `--config` - Path to a config file (default: `swebench.yaml` in the `config` directory) - `-w`, `--workers` - Number of worker threads for parallel processing (default: `1`) Data selection flags: - `--subset` - SWEBench subset to use or path to a dataset (default: `lite`) - `--split` - Dataset split (default: `dev`) - `--slice` - Slice specification (e.g., '0:5' for first 5 instances) - `--filter` - Filter instance IDs by regex - `--shuffle` - Shuffle instances (default: `False`) - `--redo-existing` - Redo existing instances (default: `False`) Advanced flags: - `--environment-class` - Environment type to use (recommended: `docker` or `singularity`) === "Single instance (for debugging)" Single instance mode runs on a single task instance with interactivity. This is meant for debugging, and so unlike the batch mode command above, this will not produce a preds.json file. ```bash mini-extra swebench-single --help # or python src/minisweagent/run/benchmarks/swebench_single.py --help # Example: mini-extra swebench-single \ --subset verified \ --split test \ --model anthropic/claude-sonnet-4-5-20250929 \ -i sympy__sympy-15599 # or mini-extra swebench-single \ --subset verified \ --split test \ -m anthropic/claude-sonnet-4-5-20250929 \ -i 0 # instance index ``` Note: If you want to run the script without prompting for confirmation at exit, add the `--exit-immediately` flag. Basic flags: - `-m`, `--model` - Model to use - `-c`, `--config` - Path to a config file (default: `swebench.yaml` in the `config` directory) - `-o`, `--output` - Output trajectory file (default: saves to global config directory) Data selection flags: - `--subset` - SWEBench subset to use or path to a dataset (default: `lite`) - `--split` - Dataset split (default: `dev`) - `-i`, `--instance` - SWE-Bench instance ID (default: `0`) Advanced flags: - `--environment-class` - Environment type to use (recommended: `docker` or `singularity`) - `--exit-immediately` - Exit immediately when the agent wants to finish instead of prompting (default: `False`) !!! tip "Evaluating on SWE-bench" You have two options to evaluate on SWE-bench: Our free cloud-based evaluation or the SWE-bench CLI. === "Cloud-based evaluation" You can use the [sb-cli](https://www.swebench.com/sb-cli/) for extremely fast, cloud-based evaluations (and it's free!). After installing it and getting a token, simply run: ```bash sb-cli submit swe-bench_verified test --predictions_path preds.json --run_id some-id-for-your-run ``` Typically you will have results within 20 minutes (this is not limited by how many instances you run, but by the slowest-to-evaluate instance in SWE-bench). === "Local evaluation" You can also use a local installation of [SWE-bench](https://github.com/SWE-bench/SWE-bench) for evaluation: ```bash python -m swebench.harness.run_evaluation \ --dataset_name princeton-nlp/SWE-bench_Verified \ --predictions_path preds.jsonl \ --max_workers \ --run_id ``` ## FAQ > Can I set global cost limits? Yes, you can set global cost limits with the `MSWEA_GLOBAL_CALL_LIMIT` and `MSWEA_GLOBAL_COST_LIMIT` environment variables/global config. See [global configuration](../advanced/global_configuration.md) for more details. > What happens to uncompleted tasks when I abort with KeyboardInterrupt? Trajectories are only saved upon completion, so most likely, you can just rerun the script to complete the tasks next time. However, you should still check for `KeyboardInterrupt` in `preds.json` in case some tasks were aborted but saved. > Certain tasks are being stuck even though I deleted the trajectories. The completed instances are inferred from `preds.json`. Remove the corresponding items from the file. > How can I run on a different dataset? As long as it follows the SWE-bench format, you can use `--subset /path/to/your/dataset` to run on a custom dataset. The dataset needs to be loadable as `datasets.load_dataset(path, split=split)`. > Some progress runners are stuck at 'initializing task' for a very long time / time out They might be pulling docker containers -- the run should start immediately the next time. If you see timeouts because of `docker pull` operations, you might want to increase `environment.pull_timeout` from the default of `120` (seconds). > I have some docker issues Try running the docker command manually to see what's going on (it should be printed out in the console). Confirm that it's running with `docker ps`, and that you can use `docker exec -it ls` to get some output. > Docker isn't available on my HPC cluster. You can use the singularity/apptainer backend by setting `environment.environment_class` to `singularity` in your [agent config file](../advanced/yaml_configuration.md) or specify `--environment-class singularity` from the command line > Can I run a startup command in the environment? Yes, you can use the `run.env_startup_command` config option to run a command in the environment before the agent starts. For example: ```yaml run: env_startup_command: "apt-get update && apt-get install -y python3-pip" ``` The command is rendered with the instance variables as template variables using `jinja2`. For example, you could use ```yaml run: env_startup_command: "git clone {{ repo_url }} . --force" ``` which might be particularly useful when running with environments like [`bubblewrap`](../reference/environments/bubblewrap.md). > What environment can I use for SWE-bench? See [this guide](../advanced/environments.md) for more details. ## Implementation ??? note "Default config" - [Read on GitHub](https://github.com/swe-agent/mini-swe-agent/blob/main/src/minisweagent/config/benchmarks/swebench.yaml) ```yaml --8<-- "src/minisweagent/config/benchmarks/swebench.yaml" ``` ??? note "`swebench.py` run script" - [Read on GitHub](https://github.com/swe-agent/mini-swe-agent/blob/main/src/minisweagent/run/benchmarks/swebench.py) - [API reference](../reference/run/swebench.md) ```python --8<-- "src/minisweagent/run/benchmarks/swebench.py" ``` ??? note "`swebench_single.py` run script" - [Read on GitHub](https://github.com/swe-agent/mini-swe-agent/blob/main/src/minisweagent/run/benchmarks/swebench_single.py) - [API reference](../reference/run/swebench_single.md) ```python --8<-- "src/minisweagent/run/benchmarks/swebench_single.py" ``` {% include-markdown "../_footer.md" %}