Benchmark Cases Runner
Introduction
Benchmark Cases Runner is a batch execution tool for Throughput Optimizer. While the Throughput Optimizer evaluates a single hardware/configuration combination per invocation, Benchmark Cases Runner drives multiple cases sequentially from a CSV input file, aggregates all results into a single output CSV, and enables cross-hardware and cross-scenario performance comparison without writing custom scripts.
Module Location: This tool is located at
experimental/optix/run_throughput_optimizer_cases.pywithin the Optix experimental module.
Use Cases:
- Multi-Configuration Evaluation: Benchmark multiple device × card-count × model × input/output-length combinations in a single run
- Cross-Hardware Comparison: Compare optimal throughput across different device profiles under consistent SLO constraints
- Regression Tracking: Run a fixed set of benchmark cases and collect results into a standardized CSV for trend analysis
Key Features:
- CSV-driven case definition: each row specifies one benchmark case
- Sequential execution with per-case error isolation (one failure does not abort the batch)
- Incremental result writing with batch flush (crash-safe at batch granularity)
- Template CSV generation with multiple example rows for quick setup
- CSV validation mode to check configuration before running
- Unified result CSV with Decode/Prefill metrics aligned with Throughput Optimizer output
Quick Start
Generate a template CSV
python -m optix.run_throughput_optimizer_cases --write-template cases_template.csv
This creates a CSV file with all required columns and three example rows covering common scenarios (1-card agg, 8-card agg, 4-card disagg with MTP). Edit the file to define your benchmark cases.
Validate your CSV before running
python -m optix.run_throughput_optimizer_cases --validate-csv cases.csv
This reads and validates all rows in the CSV, printing a summary of each parsed case (device, model, limits, mode, etc.) without actually executing the optimizer. Use this to catch configuration errors before a long batch run.
Run benchmark cases from CSV
python -m optix.run_throughput_optimizer_cases --input-csv cases.csv --output-csv results.csv
Each row in cases.csv is executed sequentially. Results are written to results.csv after every batch of 10 cases, so partial results are preserved even if the run is interrupted. If a single case fails, the error is logged and execution continues with the next case; the failed case gets a row in the CSV with empty metric columns.
CSV Input Format
The input CSV must include a header row with column names matching the template. Each subsequent row defines one benchmark case. List-type fields (e.g., mtp_acceptance_rate, ep_sizes) use semicolon (;) as the in-cell delimiter.
Required Columns
| Column | Description | Example |
|---|---|---|
case_name |
Identifier for this case (auto-generated as row_N if empty) |
8card_agg_w8a8 |
device |
Device profile name (same values as --device in Throughput Optimizer) |
ATLAS_800_A3_752T_128G_DIE |
num_devices |
Number of devices | 8 |
model_id |
HuggingFace model ID or local path | Qwen/Qwen3-32B |
input_length |
Input token length | 3500 |
output_length |
Output token length | 1500 |
ttft_limits |
TTFT SLO limit in ms (empty = no constraint) | 2000 |
tpot_limits |
TPOT SLO limit in ms (defaults to 50.0 ms if empty) | 50 |
Optional Columns
| Column | Description | Default |
|---|---|---|
tp_sizes |
TP sizes to search (semicolon-separated) | None (default range) |
quantize_linear_action |
Linear quantization action | None (falls back to W8A8_DYNAMIC) |
quantize_attention_action |
Attention quantization action | None (falls back to DISABLED) |
ep_sizes |
EP sizes to search (semicolon-separated), aligns with --ep-sizes in Throughput Optimizer |
None |
num_mtp_tokens |
Number of MTP tokens | 0 |
mtp_acceptance_rate |
MTP acceptance rates (semicolon-separated) | 0.9;0.6;0.4;0.2 |
compile |
Enable torch.compile (true/false) |
false |
mode |
Run mode: agg or disagg |
agg |
max_prefill_tokens |
Maximum prefill tokens | 8192 |
batch_range |
Batch size range (semicolon-separated min;max) | None |
serving_cost |
Serving cost | 0 |
jobs |
Number of parallel jobs | 8 |
log_level |
Log verbosity level | info |
mxfp4_group_size |
Group size for MXFP4 quantization | 32 |
reserved_memory_gb |
Reserved device memory in GB | 0 |
compile_allow_graph_break |
Allow graph breaks in compile (true/false) |
false |
Warning
ttft_limits and tpot_limits accept at most one value per case. If multiple values are provided (e.g., 50;100), the tool raises an error. Split them into separate CSV rows instead.
All limit values use milliseconds (ms) as the unit, consistent with Throughput Optimizer.
Note
If any required column (device, num_devices, model_id, input_length, output_length) is missing from the CSV header, the tool raises an error immediately rather than silently using wrong defaults.
Example CSV
case_name,device,num_devices,model_id,input_length,output_length,ttft_limits,tpot_limits,tp_sizes,quantize_linear_action,quantize_attention_action,ep_sizes,num_mtp_tokens,mtp_acceptance_rate,compile,mode,max_prefill_tokens,batch_range,serving_cost,jobs,log_level,mxfp4_group_size,reserved_memory_gb,compile_allow_graph_break
8card_agg_w8a8,ATLAS_800_A3_752T_128G_DIE,8,Qwen/Qwen3-32B,3500,1500,,50,,W8A8_DYNAMIC,DISABLED,,0,,true,agg,8192,,0,8,info,32,0,false
4card_disagg_ep,ATLAS_800_A3_752T_128G_DIE,4,Qwen/Qwen3-32B,2000,500,,50,,W8A8_DYNAMIC,INT8,2,0,,true,disagg,8192,,0,8,info,32,0,false
1card_disagg_mtp,ATLAS_800_A3_752T_128G_DIE,1,Qwen/Qwen3-32B,16000,1000,,50,,W8A8_DYNAMIC,DISABLED,,3,0.9;0.6;0.4,true,disagg,16000,,0,8,info,32,0,false
This CSV defines three cases:
- 8-card aggregation with W8A8_DYNAMIC linear quantization and TPOT ≤ 50 ms
- 4-card disaggregated with W8A8_DYNAMIC + INT8 attention quantization and EP size 2
- 1-card disaggregated prefill with MTP (3 tokens) and compile enabled
Output CSV
The output CSV contains a header row, a reference row listing valid quantization options, and one result row per benchmark case. Key columns include:
| Column | Description |
|---|---|
Case_Name |
Case identifier |
Device Type |
Device profile name |
Number of Devices |
Number of devices |
Input Length / Output Length |
Token lengths |
Model |
Model identifier |
Decode_Linear Quant Type |
Linear quantization type for best decode config |
Decode_Attn Quant Type |
Attention quantization type for best decode config |
Decode_Use EP |
EP size used (shown when ep_size > 1) |
Decode_MTP Tokens |
Number of MTP tokens |
Decode_TPOT Target(ms) |
TPOT SLO target in ms |
Decode_Concurrency |
Best decode concurrency |
Decode_TPOT(ms) |
Best decode TPOT under SLO |
Decode_Total TPS |
Best decode total throughput |
Decode_TPS/Device |
Best decode per-device throughput |
Decode_Mem / Decode_Comm / Decode_Cube / Decode_Vec |
Performance breakdown percentages |
Decode_TP Size / Decode_PP Size / Decode_DP Size |
Best parallel config |
Prefill_Linear Quant Type |
Linear quantization type for best prefill config |
Prefill_Attn Quant Type |
Attention quantization type for best prefill config |
Prefill_Use EP |
EP size used (shown when ep_size > 1) |
Prefill_MTP Tokens |
Number of MTP tokens |
Prefill_TTFT Target(ms) |
TTFT SLO target in ms |
Prefill_Concurrency |
Best prefill concurrency |
Prefill_TTFT(ms) |
Best prefill TTFT under SLO |
Prefill_Total TPS |
Best prefill total throughput |
Prefill_TPS/Device |
Best prefill per-device throughput |
Prefill_Mem / Prefill_Comm / Prefill_Cube / Prefill_Vec |
Performance breakdown percentages |
Prefill_TP Size / Prefill_PP Size / Prefill_DP Size |
Best parallel config |
QuantizeLinearAction_options |
All valid linear quantization actions |
QuantizeAttentionAction_options |
All valid attention quantization actions |
Note
If a case has no prefill result (e.g., disagg decode-only mode), the prefill metric columns are left empty. If a case has no decode result, the decode metric columns are left empty. If no configuration satisfies the SLO, the corresponding metric columns are present but the best values reflect the closest attempt. If a case fails entirely (e.g., invalid model, device not found), the metric columns are empty and the error is printed to stderr; the case still gets a row in the CSV.
CLI Parameters
Options:
-h, --help show this help message and exit
--input-csv INPUT_CSV
Path to input CSV file with benchmark cases (one case per row).
Header must include columns from the template.
--write-template PATH
Write a template CSV with multiple example rows to PATH and exit.
--output-csv OUTPUT_CSV
Path to output CSV file for results. Defaults to
'benchmark_cases_results.csv' when --input-csv is used.
--test-conversion Run internal conversion test and exit.
--validate-csv PATH Validate the input CSV file at PATH and print a summary
of parsed cases without executing the optimizer.
Execution Behavior
- Sequential execution: Cases run one at a time in CSV row order.
- Per-case error isolation: If a single case fails (e.g., invalid model, device not found, runtime exception), the error is printed to stderr and the case gets a row with empty metrics in the output CSV. Execution continues with the next case.
- Batch flush: Results are flushed to disk every 10 cases. If the process is interrupted, results from completed batches are preserved.
- In-process execution: Each case calls
ParallelRunnerfromserving_cast.parallel_runnerin-process. Results fromParallelRunnerare processed usingOptimizerSummary._prepare_agg_disagg_results()for SLO filtering, which correctly handles disagg mode by applying only the relevant limit per phase (TTFT for prefill summaries, TPOT for decode summaries). Log level is configured vialogging.basicConfigbefore each case, consistent with Throughput Optimizer.
Relationship to Throughput Optimizer
Benchmark Cases Runner internally calls Throughput Optimizer's ParallelRunner for each case. The mapping from CSV columns to Throughput Optimizer arguments is:
| CSV Column | Throughput Optimizer Argument |
|---|---|
device |
--device |
num_devices |
--num-devices |
model_id |
positional model_id |
input_length |
--input-length |
output_length |
--output-length |
ttft_limits |
--ttft-limits |
tpot_limits |
--tpot-limits |
compile |
--compile |
mode |
--disagg (when set to disagg) |
quantize_linear_action |
--quantize-linear-action |
quantize_attention_action |
--quantize-attention-action |
tp_sizes |
--tp-sizes |
ep_sizes |
--ep-sizes |
num_mtp_tokens |
--num-mtp-tokens |
mtp_acceptance_rate |
--mtp-acceptance-rate |
max_prefill_tokens |
--max-prefill-tokens |
batch_range |
--batch-range |
serving_cost |
--serving-cost |
jobs |
--jobs |
log_level |
--log-level |
reserved_memory_gb |
--reserved-memory-gb |
compile_allow_graph_break |
--compile-allow-graph-break |
mxfp4_group_size |
--mxfp4-group-size |
The following Throughput Optimizer arguments are set to defaults since they are not exposed as CSV columns:
| Argument | Default Value | Reason |
|---|---|---|
image_batch_size |
None |
Multimodal not supported in CSV |
image_height |
None |
Multimodal not supported in CSV |
image_width |
None |
Multimodal not supported in CSV |
prefix_cache_hit_rate |
0.0 |
Not commonly varied per case |
enable_multistream |
True |
Default from Throughput Optimizer |
enable_optimize_prefill_decode_ratio |
False |
PD ratio optimization not exposed |
prefill_devices_per_instance |
None |
PD ratio optimization not exposed |
decode_devices_per_instance |
None |
PD ratio optimization not exposed |
moe_dp_sizes |
None |
MOE-DP search not exposed |
For detailed parameter descriptions, see the Throughput Optimizer documentation.