RFC: Standalone Performance Database Collection Tooling
Metadata
| Item | Content |
|---|---|
| Status | Complete |
| Author(s) | Horacehxw, Codex |
| Creation Date | 2026-05-12 |
| Related Links | https://gitcode.com/Ascend/msmodeling/pull/124 |
1. Overview
The performance database can only be trusted by runtime performance models when its CSV data, operator mappings, replay scripts, communication benchmarks, and validation reports come from a repeatable offline flow. This RFC defines tools/perf_data_collection/ as the offline data-production toolchain, covering raw profiling parsing, FIA runtime metadata backfill, theory shape-grid generation, per-kernel replay, msprof writeback, HCCL communication benchmarking, and basic trace conversion.
The produced database is consumed by EmpiricalPerformanceModel / ProfilingDataSource: runtime only needs to read op_mapping.yaml, per-kernel compute CSVs, and optional HCCL communication CSVs. This RFC fully defines how those data artifacts are generated, validated, and promoted into a reviewable performance database, without requiring any other design document to understand the database generation flow.
This RFC defines the standalone collection and tooling boundary for the performance database: how raw NPU profiling results become versioned CSV data, how synthetic shapes and microbench measurements are added, how HCCL communication data is collected, and how data is validated before runtime query code consumes it.
1.1 Goals
- Provide an offline data-production contract that does not depend on temporary design or planning documents.
- Define the database directory structure, compute CSV schema, FIA-specific columns, communication CSV schema, and versioning rules.
- Clarify the boundary between
op_mapping.yaml, raw profiling parsing, shape-grid generation, operator replay scripts,msproforchestration, and CSV writeback. - Record tooling boundaries and limitations so contributors know which steps are automated and which steps remain manually maintained.
- Give human reviewers a shared checklist for deciding whether CSV/YAML/report changes can enter a production database directory.
- Provide a rollout and test plan where most checks run without NPU hardware, while replay/HCCL checks remain isolated to hardware hosts.
1.2 Non-goals
- This RFC does not change
ProfilingDataSourceruntime query behavior or interpolation logic. - This RFC does not define model-level alignment metrics or end-to-end simulation strategy; it only requires the produced data to be consumable by those systems.
- This RFC does not require an external database service, remote artifact repository, or online collection pipeline.
- This RFC does not claim replay coverage for every profiling kernel.
- This RFC does not make temporary planning documents stable interfaces.
2. Solution Design
2.1 Recommended Solution
Define the standard offline collection pipeline as:
flowchart TD
A[Raw NPU profiling bundle] --> B[parsers/parse_kernel_details.py]
B --> C[Per-kernel compute CSVs]
C --> D[fill_fia_runtime_metadata.py]
C --> E[generate_shape_grid.py]
D --> E
E --> F[op_replay/*_run.py]
F --> G[start_microbench.py under msprof]
G --> H[CSV writeback and reports]
I[HCCL benchmark] --> J[hccl/{cann_version}/hcom_*.csv]
J --> H
K[op_mapping.yaml] --> F
K --> H
H --> L[Versioned profiling_database/data directory]
The flow remains file-based. CSV and YAML are reviewable and diffable, and they remain compatible with the runtime data source. Tools may generate or rewrite rows, but persistent data artifacts stay in versioned database directories.
2.1.0 Tool Scope
This RFC defines the entry points and responsibilities under tools/perf_data_collection/:
| Scope | Entry point | Database responsibility |
|---|---|---|
| Raw profiling parsing | parsers/parse_kernel_details.py |
Generate per-kernel compute CSVs from kernel_details*.csv or a profiling directory |
| FIA runtime metadata backfill | fill_fia_runtime_metadata.py |
Merge FIA JSONL runtime values into FusedInferAttentionScore.csv |
| Shape-grid expansion | generate_shape_grid.py, grid_generator/, memory_estimator.py |
Append replayable theory shape rows and filter them by HBM budget |
| Compute operator replay | op_replay/*_run.py, op_replay/run_all_op.py, start_microbench.py |
Replay CSV rows on an NPU host and write microbench latency back |
| HCCL communication collection | comm_bench/generate_comm_microbench.py, comm_bench/run_comm_bench.sh, comm_bench/validate_comm_alignment.py |
Generate or validate hccl/{cann_version}/hcom_*.csv |
| Trace conversion support | parsers/trace_to_csv.py |
Convert TensorCast Chrome traces to CSV for offline alignment and reports; it does not directly produce operator database CSVs |
2.1.1 Database Directory Contract
The database has two persistent data classes:
| Data type | Directory | Content | Version scope |
|---|---|---|---|
| Compute/operator data | tensor_cast/performance_model/profiling_database/data/{device}/vllm_ascend/{version_dir}/ |
op_mapping.yaml and {KernelType}.csv |
Device + vLLM-Ascend + PyTorch + CANN stack |
| Communication data | tensor_cast/performance_model/profiling_database/data/{device}/hccl/{cann_version}/ |
hcom_*.csv benchmark files, with optional topology/config metadata |
Device + CANN/HCCL stack |
{version_dir} follows the helper convention:
vllm{vllm_version}_torch{torch_version}_cann{cann_version}
Recommended directory examples include:
vllm0.13.0_torch2.8.0_cann8.3vllm0.15.0_torch2.9.0_cann8.5vllm0.18.0_torch2.9.0_cann8.5vllm0.18.0_torch2.9.0_cann8.5_shape_generated, a generated/staging example that should not be mistaken for a complete measured directory.
Each compute directory owns one op_mapping.yaml for that stack. Communication CSVs are not copied into every compute directory; op_mapping.yaml points to the communication directory through communication_data_ref, for example:
communication_data_ref: "../../hccl/v8.5/"
communication_fallback: analytic
This allows multiple vLLM-Ascend versions to reuse HCCL measurements when the device and CANN/HCCL stack are unchanged.
2.1.2 Compute CSV Contract
Each compute CSV is named from the NPU profiling Type column after necessary normalization:
{KernelType}.csv
The parser and runtime data source use a CSV contract based on shapes and metadata rather than a Python object schema. The base required columns are:
| Column group | Columns | Producer |
|---|---|---|
| Kernel identity and shape | OP State, Accelerator Core, Input Shapes, Input Data Types, Input Formats, Output Shapes, Output Data Types, Output Formats |
Raw profiling parser |
| Profiling latency | Profiling Average Duration(us), Profiling Median Duration(us), Profiling Std Duration(us) |
Raw profiling parser |
| Profiling counters | Profiling Average aicore_time(us), Profiling Average aic_total_cycles, Profiling Average aic_mac_time(us), and related AIC/AIV utilization columns |
Raw profiling parser |
| Microbench latency | Average Duration(us) |
start_microbench.py writeback |
| Microbench counters | MicroBench aicore_time(us), MicroBench aic_total_cycles, and corresponding MicroBench ... counter columns |
start_microbench.py writeback |
Shape fields use semicolons to separate tensor slots and commas to separate dimensions:
"136,7168;7168,3584"
Key rules:
- Slot count has semantics. Empty slots must be preserved for FIA, custom kernels, and operators with optional inputs or scalar parameters.
Input Data TypesandInput Formatsalign withInput Shapesby slot index.FRACTAL_NZis a valid format; replay and runtime lookup normalize it when needed.- Generated rows may initially have no latency or a zero latency. A row should be considered measured only after replay/writeback fills
Average Duration(us)or profiling import fillsProfiling Average Duration(us). - Runtime lookup selects latency columns in this order:
Average Duration(us),Profiling Average Duration(us), thenDuration(us).
2.1.3 FIA CSV Extension Contract
FusedInferAttentionScore.csv needs additional runtime metadata because shape alone cannot fully describe paged attention behavior. The parser and backfill tool use the following columns when available:
| Column | Meaning |
|---|---|
Runtime source_profile |
Source profiling subdirectory or source tag |
Runtime actual_seq_lengths_shape / Runtime actual_seq_lengths_values |
Query sequence-length metadata |
Runtime actual_seq_lengths_kv_shape / Runtime actual_seq_lengths_kv_values |
KV sequence-length metadata |
Runtime avg_seq_len |
Average KV sequence length used by attention lookup and interpolation |
Runtime block_table_shape / Runtime block_table_valid_blocks |
Paged cache block-table metadata |
Runtime num_heads / Runtime num_key_value_heads |
Runtime head configuration |
Runtime sparse_mode, Runtime input_layout, Runtime block_size |
FIA execution-mode fields |
Runtime attn_state, Runtime kv_cache_mode |
Optional state/cache-mode fields |
Runtime metadata_completeness |
Completeness tag, for example profile_shapes_only, runtime_values, runtime_values_dumped, or shape_grid_scene_generated |
The parser can recover FIA shape-level metadata from profiling bundles. Full runtime values require a JSONL dump plus fill_fia_runtime_metadata.py. When those values are missing, the replay script falls back to heuristic inference; such rows remain useful, but they should not be treated as fully equivalent reproductions of whole-model runtime state.
2.1.4 Communication CSV Contract
Communication data is stored under the HCCL directory, one CSV per HCCL kernel. Standard filenames are:
hcom_allGather_.csvhcom_allReduce_.csvhcom_alltoallv_.csvhcom_reduceScatter_.csv
Communication CSV columns are split between runtime-required columns and collection-standard columns. ProfilingDataSource runtime lookup requires message_bytes, num_devices, and one usable latency column. topology_tier participates in matching only when the CSV contains it and DeviceProfile.comm_grid can resolve it. dtype and bandwidth_gbps are collection-standard columns for audit and validation; lookup does not filter by dtype. communication_fallback records the intended fallback policy; actual MISS fallback is handled by the upper empirical model.
| Column | Level | Type | Meaning |
|---|---|---|---|
message_bytes |
Runtime-required | int | Per-rank message size used for lookup and interpolation |
num_devices |
Runtime-required | int | Number of ranks in the communication group |
Average Duration(us) / Profiling Average Duration(us) / Duration(us) |
One runtime-required latency column | float | Communication latency; runtime selects by the common latency-column priority |
topology_tier |
Conditional match | int | Device topology tier; used when topology can be resolved |
dtype |
Collection-standard | string | Measured dtype, such as DT_BF16 or INT8 |
bandwidth_gbps |
Collection-standard | float | Derived bandwidth for audit and validation |
ProfilingDataSource first searches for a communication row that exactly matches message_bytes and num_devices; when topology_tier can be resolved and the CSV contains that column, topology_tier must also match. On miss, it finds bracket rows within the same num_devices and available topology tier, fits an alpha-beta latency model from those matching rows, and clamps the predicted latency to the bracket latency bounds. If DeviceProfile.comm_grid is absent, topology filtering is skipped; in that case, the CSV should avoid ambiguous rows for the same message/device count.
2.1.5 Relationship Between op_mapping.yaml and Tooling
op_mapping.yaml is both the runtime query contract and a tooling input. It maps TensorCast operators to NPU profiling kernel types and records the evidence chain for each mapping.
Important top-level fields:
| Field | Purpose |
|---|---|
version, device, cann_version, pytorch_version, op_plugin_version, collection_date |
Identify the software stack and collection batch |
communication_data_ref |
Relative path to HCCL communication CSV data |
communication_fallback |
Intended fallback policy; actual MISS fallback is handled by the upper empirical model |
interpolation_policy |
Optional interpolation behavior, such as sqrt transform for FIA |
operator_mappings |
Runtime mapping from TensorCast op name to kernel query rule |
torch_npu_reference |
Reference API metadata that can help create replay scripts |
Important operator_mappings fields:
| Field | Meaning |
|---|---|
kernel_type |
Primary {KernelType}.csv file stem |
alternate_kernel_types |
Candidate CSVs for version drift or kernel variants |
category: communication |
Route the query to communication lookup |
query_mode |
Use special query logic, such as attention_special, elementwise, or moe_fused |
composite, sub_kernels, decomposer |
Model one TensorCast op as multiple NPU kernels |
tc_input_count |
Truncate TensorCast inputs for matching when TC and NPU signatures differ |
zero_cost |
Mark shape-only or fusion-absorbed ops as measured zero latency |
accepted_miss |
Record expected misses so they do not block validation |
notes |
Evidence chain and review context |
The first phase does not require automatically consuming torch_npu_reference.{KernelType}.microbench_api to generate op_replay/<KernelType>_run.py. Replay scripts are maintained manually. When adding or updating a replay script, contributors should use the Microbench Run Script Generator workflow in docs/perf_database/skills/microbench/SKILL.md: generate or update the script from the operator CSV, op_mapping.yaml microbench_api, upstream interface docs/tests, and the local replay conventions. A future generator can build on this workflow, but this RFC treats torch_npu_reference as auxiliary metadata rather than first-phase automation.
2.1.6 Raw Profiling Parser
tools/perf_data_collection/parsers/parse_kernel_details.py is the entry point for converting Ascend profiling output into per-kernel CSV files.
Parser behavior:
--profiling-pathmay be a singlekernel_details*.csvfile or a profiling directory.- Directory input is scanned recursively for CSV filenames containing
kernel_details. operator_details.csvandtrace_view.jsonare discovered for FIA bundle inspection and later enrichment.- Kernel details must contain
Type,OP State,Accelerator Core, shape/dtype/format columns,Duration(us), and AIC/AIV counter columns. - Known variant names are normalized, including
split_qkv_rmsnorm_rope_kernel_0tosplit_qkv_rmsnorm_rope_kernel, andmuls_add_kernel_1tomuls_add_kernel. - Rows are aggregated by
(normalized Type, Input Shapes, Output Shapes). - Output includes average, median, and standard-deviation latency columns, plus averaged hardware counters.
- Each operator type is written as
{KernelType}.csvunder the target database directory.
The parser is conservative: it does not infer missing tensor slots and does not rewrite API semantics. It preserves the profiling representation and leaves shape/API alignment to op_mapping.yaml, generate_shape_grid.py, and replay scripts.
2.1.7 FIA Runtime Metadata Enrichment
FIA enrichment has two layers:
parsers/parse_kernel_details.pycan attach shape-level metadata from a profiling bundle and mark rows asRuntime metadata_completeness=profile_shapes_only.tools/perf_data_collection/fill_fia_runtime_metadata.pycan merge runtime JSONL dumps intoFusedInferAttentionScore.csv.
JSONL backfill uses LEFT JOIN semantics:
- CSV key: query/key/value shapes, KV sequence item count, attention mask shape, and block-table shape.
- JSONL key: corresponding runtime tensor metadata.
- If multiple runtime records match one CSV row, the row expands to 1:N.
- Matched rows receive a configurable metadata tag; the default is
runtime_values_dumped.
The minimum JSONL join fields are query_shape, key_shape, value_shape, actual_seq_lengths_kv, atten_mask_shape, and block_table_shape. Optional payload fields include actual_seq_lengths, block_table_valid_blocks, num_heads, num_key_value_heads, sparse_mode, input_layout, and block_size. Each line is one JSON object, for example:
{"query_shape":[8192,16,128],"key_shape":[8192,16,128],"value_shape":[8192,16,128],"actual_seq_lengths_kv":[40,998],"atten_mask_shape":null,"block_table_shape":null,"num_heads":16,"num_key_value_heads":2,"sparse_mode":0,"input_layout":"TND","block_size":128}
This design makes metadata incompleteness explicit. Shape-only FIA rows remain usable, but validation reports should treat profile_shapes_only rows as lower confidence than rows with real runtime sequence values.
2.1.8 Shape Grid Generation
tools/perf_data_collection/generate_shape_grid.py appends theory-generated rows to database CSV files. It does not replace real profiling data and does not produce measured latency by itself.
Shape-grid behavior:
grid_generator/config.yamlroutes kernel types to theory patterns.- Template patterns cover GEMM, quantized matmul, elementwise, norm, quantization, RoPE, sampling, KV cache, and shape-manipulation categories.
- Fused attention, grouped matmul, and
DispatchFFNCombineuse more complex Python generators. --target-modelsuses known model configurations to prune GEMM(N, K)candidates.--rowscaps rows per CSV;--seedmakes sampling reproducible.--max-hbm-gbfilters generated rows whose estimated input/output tensors exceed the memory budget.- Files without a matching theory generator are skipped.
Generated rows should inherit stable structural metadata from source rows and leave performance values blank or zero. They become production data only after start_microbench.py or a later profiling import fills latency.
2.1.9 Per-kernel op_replay Script Framework
tools/perf_data_collection/op_replay/ stores per-kernel replay implementations. Each covered kernel should provide a matching *_run.py script for categories such as matmul, quantization, FIA, KV cache, softmax, sort, transpose, and vLLM-Ascend custom kernels.
Shared conventions:
common.pyhandles database path resolution, software-stack version directory naming, dtype/format parsing, tensor construction, repeat-count handling, and invalid-row tracking.replay_framework.pyprovides anOpReplayhelper for standard API-style kernels.run_all_op.pydiscovers*_run.py, supports--execution-mode inprocessfor one outermsprofsession, and writesrun_all_op_status.json.- Each operator script reads the matching
{KernelType}.csv, replays each row on NPU, prints concise[OK]messages, and deletes invalid rows when replay case construction fails. - When
--repeat-countis omitted,MSMODELING_OP_REPLAY_REPEAT_COUNTcan provide the default repeat count. - When adding or modifying
<KernelType>_run.py, contributors should follow the Microbench Run Script Generator skill's core steps: read the target CSV, locatetorch_npu_reference.<KernelType>.microbench_api, confirm the real API from upstream repo docs/tests, infer missing non-tensor arguments, reusecommon.py/replay_framework.pyconventions, and validate at leastpy_compileplus--help.
Coverage limitation: replay coverage can expand incrementally. A CSV may exist without a corresponding replay script, and some scripts only support their target recorded shape/API patterns. Custom operators also require correct ASCEND_CUSTOM_OPP_PATH and LD_LIBRARY_PATH. Regular CI coverage for NPU-dependent replay is necessarily limited; most tests cover imports, CLI help, pure parsing logic, and writeback units.
2.1.10 start_microbench.py msprof Orchestration and Writeback
tools/perf_data_collection/start_microbench.py is the production entry point for replay measurement writeback.
Flow:
- Resolve the target database directory through
--database-pathor device/version parameters. - Validate selected operators and required custom OPP environment variables.
- Run
msprof --output=... python op_replay/run_all_op.py --execution-mode inprocess. - Locate generated
PROF_*directories andop_summary_*.csvfiles. - Aggregate rows by operator type and signature.
- Write
Average Duration(us)andMicroBench ...counter columns back to matching database CSV rows. - Generate a markdown update report and duration-gap CSV under
reports/.
Row matching uses:
Input Shapes
Input Data Types
Input Formats
Output Shapes
Output Data Types
DispatchFFNCombine also includes EP Size in the signature. Update modes:
| Mode | Behavior |
|---|---|
all |
Update all matching rows and append unmatched profiling samples |
missing-only |
Replay/fill only rows whose microbench and profiling latencies are invalid; unmatched samples are reported but not appended |
--prof-path can parse an existing PROF_* directory without launching msprof. --prune-empty-duration-rows deletes rows that still have no valid replay or profiling latency after writeback.
Timing caveat: Average Duration(us) is the microbench value aggregated from op_summary_*.csv task duration. It is the runtime lookup's preferred latency, but comparison reports should still keep Profiling Average Duration(us) to expose replay versus whole-model profiling gaps.
2.1.11 HCCL Communication Benchmark
Communication collection is separate from compute replay because HCCL latency depends on rank-group size and topology.
tools/perf_data_collection/comm_bench/generate_comm_microbench.py supports:
- Operators: all-reduce, all-gather, reduce-scatter, all-to-all.
--bytes-gridfor the message-bytes grid.--num-devices.--topology-tier.--grid-shapefor hardware topology.- dtype selection.
--bench-mode kernelor--bench-mode event.--do-runfor direct execution undertorchrun.--output-dirfor per-op files, or--output-csvfor a single file.
tools/perf_data_collection/comm_bench/run_comm_bench.sh is the batch wrapper. validate_comm_alignment.py checks whether measured rows align with the parser/query model under a configurable ratio tolerance.
Limitations:
- The query convention uses lowercase
hcom_prefixes, for examplehcom_reduceScatter_.csv; CamelCase graph-compiled names are version-drift clues but not primary HCCL CSV filenames. - Communication dtype is recorded in CSV, but the lookup path primarily matches message size, device count, and topology tier.
- If topology tier cannot be resolved from
DeviceProfile, communication data must be organized to avoid ambiguity by directory and row contents.
2.1.12 Validation and Reports
Validation should cover three levels:
| Level | Checks | Representative command |
|---|---|---|
| Static/tooling | Python syntax, CLI help, schema unit tests, shape parser tests, HCCL benchmark pure-logic tests | pytest tests/tools/test_op_replay.py tests/tools/test_op_replay_common.py tests/tools/test_start_microbench.py tests/tools/test_generate_shape_grid.py tests/tools/test_fia_parser_backfill.py tests/tools/test_generate_comm_microbench.py tests/tools/test_validate_comm_alignment.py |
| Database/query | op_mapping.yaml schema, ProfilingDataSource lookup, interpolation, FIA enriched lookup |
pytest tests/perf_database |
| Hardware-dependent | NPU replay, HCCL benchmark, writeback from real msprof output |
Run pytest -m npu and selected start_microbench.py / comm_bench commands on an NPU host |
Each data refresh should produce:
- List of created or updated CSV files.
- Replay status summary from
run_all_op_status.json. - Profile update report from
start_microbench.py. - Duration-gap hotspot CSV when both replay and whole-model profiling latency exist.
- List of skipped operators or missing replay scripts.
- List of generated rows that still lack valid latency.
Validation must not hide partial coverage. Accepted miss, zero-cost ops, interpolation, partial composite hit, and FIA metadata completeness must remain visible in reports or CSV metadata.
2.1.13 Tool CLI Contract
| Tool | Required input | Main output | Key failure modes |
|---|---|---|---|
parsers/parse_kernel_details.py |
--profiling-path; output directory from --database-path or device/version parameters |
Per-kernel {KernelType}.csv |
Fails when input lacks required profiling columns |
fill_fia_runtime_metadata.py |
--csv-path, --jsonl-path |
Enriched FusedInferAttentionScore.csv; overwrites in place by default, or uses --output-path |
Fails when CSV/JSONL is missing, CSV header is empty, or JSONL is invalid |
generate_shape_grid.py |
--data-dir, or --device --vllm-version [--torch-version --cann-version] |
Appends theory shape rows to CSVs | Fails when data dir is missing or contains no CSV; CSVs without a generator are skipped and reported |
start_microbench.py |
--database-path, or device/version parameters; optional --prof-path |
Writes Average Duration(us) and MicroBench ... counters, generates reports/ |
msprof missing, OPP env missing, no PROF_* / op_summary_*.csv; duplicate signatures, unmatched rows, and gaps enter the report |
op_replay/run_all_op.py |
Database path parameters; optional --op |
Executes matching *_run.py scripts and writes run_all_op_status.json |
Single-op failure; default is fail-fast, --continue-on-error runs remaining scripts |
comm_bench/generate_comm_microbench.py |
torchrun ... --do-run |
hcom_*.csv |
Without --do-run, no collection is executed; --output-csv allows only one --ops value |
comm_bench/validate_comm_alignment.py |
--csv-dir |
PASS/WARN/FAIL alignment report; nonzero exit on FAIL | Fails when --csv-dir is not a directory; ratios beyond tolerance fail; malformed rows are not counted, so reviewers should check 0-row reports |
2.1.14 Human Review Checklist
Before a data refresh or tooling change is sent for human review, it should provide at least:
- Target software stack: device, vLLM-Ascend, PyTorch, CANN, op-plugin, and collection date.
- Raw input sources: profiling bundle name or path summary, whether
operator_details.csv/trace_view.jsonexist, and whether FIA runtime JSONL exists. - CSV schema changes: added, removed, or renamed columns; whether empty tensor slots are preserved; whether latency-column priority is still satisfied.
op_mapping.yamlchanges: added/modified mappings, evidence chains,alternate_kernel_types,tc_input_count,accepted_miss,zero_cost, and composite/decomposer impact.- Shape-grid changes:
--target-models,--rows,--seed,--max-hbm-gb, and whether generated-only rows have been replayed. - Replay results:
run_all_op_status.jsonsummary, profile update report, duration-gap hotspot CSV, invalid rows, and duplicate signatures. - HCCL results:
hcom_*.csvfile list,num_devices/topology_tiercoverage,bench-mode,validate_comm_alignment.pytolerance, and result. - Known limitations: missing replay scripts, incomplete FIA metadata, staging-only generated directories, and reasons hardware-dependent tests were not run.
2.1.15 Data Lifecycle and Version Management
Recommended lifecycle:
- Create or select the compute directory for a software stack.
- Import raw profiling with
parsers/parse_kernel_details.py. - Add or update
op_mapping.yamlin the same software-stack directory. - Enrich FIA metadata if runtime JSONL values are available.
- Generate shape rows in a staging copy or clearly named generated directory.
- First replay selected operators with
start_microbench.py --update-mode missing-only. - Review writeback reports, missing shapes, invalid rows, duplicate signatures, and duration gaps.
- Submit CSV/YAML changes together with changelog or reports, then promote the directory.
Versioning rules:
- Isolate software-stack changes by
{version_dir}. Do not silently mix data from different vLLM-Ascend, PyTorch, or CANN versions. - Store HCCL data under
{device}/hccl/{cann_version}and reuse it throughcommunication_data_ref. - Update
collection_datewhen a database directory receives a new profiling import or a large replay refresh. - Keep published version directories unless there is an explicit deprecation notice.
- Use clear names for generated-only directories so users do not confuse them with complete measured data.
- Do not commit raw
PROF_*directories unless review explicitly requires them; commit derived CSVs and report artifacts.
2.2 Alternative Solutions
Option A: Use Only Whole-model Raw Profiling
Import only whole-model profiling rows, without shape-grid generation or microbenchmark replay.
This is simple and closest to real workloads, but shape coverage is sparse. It cannot fill important neighboring shapes and makes runtime lookup overly dependent on one captured model/profile.
Option B: Generate All Replay Scripts from torch_npu_reference
Read op_mapping.yaml and automatically generate all op_replay/<KernelType>_run.py files from torch_npu_reference.{KernelType}.microbench_api.
This is a better long-term direction, but it is outside the first-phase scope. Many kernels need non-tensor parameters, custom OPP settings, preserved slots, valid cache tensors, or version-dependent API behavior that cannot be safely inferred from a single API string.
Option C: Use an External Database
Move CSV/YAML data to SQLite, DuckDB, or a service database.
This would improve indexing and schema validation, but it would add deployment and review cost in the first phase. The recommended solution keeps the runtime data source and review flow centered on file artifacts.
Option D: Replace Measurement with Analytic Models
Avoid replay collection and use analytic formulas for both compute and communication.
Analytic models are appropriate fallbacks, especially for communication, but they cannot accurately capture fusion, CANN kernel selection, tensor formats, and custom-op behavior.
2.3 Solution Analysis
Advantages
- Keeps data production offline, repeatable, and reviewable.
- Adopts the runtime contract:
op_mapping.yamlplus per-kernel CSV files. - Splits raw profiling import, metadata enrichment, shape expansion, replay measurement, and validation into clear tooling responsibilities.
- Allows partial data to be explicit instead of blocking an entire database refresh.
- Reuses HCCL measurements across compatible compute-stack directories.
- Supports exact measured rows and generated rows that can be filled incrementally.
Limitations
- Replay scripts require manual maintenance, and replay coverage expands incrementally.
- The first phase does not auto-consume
torch_npu_referenceto generate replay scripts. - NPU-dependent replay and communication tests have limited regular CI coverage.
- CSV schema is easy to diff, but also easy to damage through manual edits.
- FIA replay can remain only an approximate reproduction when runtime sequence metadata is missing.
- Shape-grid generation may produce rows that look structurally valid but cannot be replayed by a specific operator API.
3. Implementation Plan
3.1 Rollout Plan
The schedule is phase-ordered. Phases 0-2 can run on any development host. Phases 3-4 depend on prepared NPU host windows. Phase 5 starts only after replay and communication reports are available.
| Milestone | Owner role and order | Scope | Exit criteria |
|---|---|---|---|
| Phase 0: RFC and contract | RFC owner; first phase before implementation work | Adopt this RFC as the standalone collection-tooling contract | Contributors can execute the flow without temporary documents |
| Phase 1: Parser/schema baseline | Tooling owner; no-NPU phase after Phase 0 | Validate raw parsing, output column order, and software-stack directory naming | Existing no-NPU parser and data-source tests pass |
| Phase 2: Shape-grid staging | Data collection owner; no-NPU staging phase after Phase 1 | Use --rows, --seed, and --max-hbm-gb in a staging directory |
Generated rows preserve slot/dtype/format contracts and report skipped kernels |
| Phase 3: Replay writeback | Data collection owner and NPU host owner; scheduled when an NPU host is available | Run selected op_replay scripts through start_microbench.py on an NPU host |
Average Duration(us) is filled, invalid rows are reported, and gap reports are generated |
| Phase 4: Communication collection | HCCL/NPU owner; scheduled with the same or adjacent NPU host window | Collect or refresh HCCL CSVs for the target CANN version | validate_comm_alignment.py passes under the agreed tolerance |
| Phase 5: Promotion | Data reviewer and runtime owner; after Phase 3-4 report review | Review CSV/YAML/report changes and promote the database directory | Runtime lookup tests pass and known limitations are recorded in the refresh report |
3.2 Test Plan
Run no-NPU checks before any hardware collection:
python -m pytest tests/tools/test_op_replay.py \
tests/tools/test_op_replay_common.py \
tests/tools/test_start_microbench.py \
tests/tools/test_generate_shape_grid.py \
tests/tools/test_fia_parser_backfill.py \
tests/tools/test_generate_comm_microbench.py \
tests/tools/test_validate_comm_alignment.py
python -m pytest tests/perf_database
Run syntax and help checks for changed replay scripts:
python -m py_compile tools/perf_data_collection/op_replay/<KernelType>_run.py
python tools/perf_data_collection/op_replay/<KernelType>_run.py --help
Run hardware-dependent checks only on a prepared NPU host:
python tools/perf_data_collection/start_microbench.py \
--database-path tensor_cast/performance_model/profiling_database/data/{device}/vllm_ascend/{version_dir} \
--op MatMulV2 RmsNorm \
--repeat-count 1 \
--update-mode missing-only
bash tools/perf_data_collection/comm_bench/run_comm_bench.sh \
tensor_cast/performance_model/profiling_database/data/{device}/hccl/{cann_version}
3.3 Follow-up Work
- Add an in-repo replay-script generator that consumes
torch_npu_reference.{KernelType}.microbench_apiand operator-specific metadata. - Add stricter CSV schema validation and duplicate-signature checks before writeback.
- Expand NPU-marked integration tests to cover representative custom operators.
- Improve FIA runtime metadata capture to reduce replay rows that depend on heuristic inference.
- Add a single collection command that orchestrates parse, enrich, shape generation, replay, communication validation, and report packaging while preserving this file contract.
3.4 Completion Criteria
tools/perf_data_collection/entry points, database directories, CSV schemas, and theop_mapping.yamlcontract match this RFC.- At least one target software-stack directory can be generated through parse, optional enrich, shape generation, replay writeback, and report generation.
- The HCCL directory can be generated by communication benchmark tooling and referenced through
communication_data_ref. - No-NPU tests cover parser behavior, shape grid, FIA backfill,
start_microbenchwriteback logic, communication CSV validation, and data-source query behavior. - On an NPU host, selected replay and communication benchmark commands can run and produce status and gap reports for human review.