ascend-robotdocs: add performance database collection tooling RFC

RFC: Standalone Performance Database Collection Tooling

Metadata

Item	Content
Status	Complete
Author(s)	Horacehxw, Codex
Creation Date	2026-05-12
Related Links	https://gitcode.com/Ascend/msmodeling/pull/124

1. Overview

The performance database can only be trusted by runtime performance models when its CSV data, operator mappings, replay scripts, communication benchmarks, and validation reports come from a repeatable offline flow. This RFC defines tools/perf_data_collection/ as the offline data-production toolchain, covering raw profiling parsing, FIA runtime metadata backfill, theory shape-grid generation, per-kernel replay, msprof writeback, HCCL communication benchmarking, and basic trace conversion.

The produced database is consumed by EmpiricalPerformanceModel / ProfilingDataSource: runtime only needs to read op_mapping.yaml, per-kernel compute CSVs, and optional HCCL communication CSVs. This RFC fully defines how those data artifacts are generated, validated, and promoted into a reviewable performance database, without requiring any other design document to understand the database generation flow.

This RFC defines the standalone collection and tooling boundary for the performance database: how raw NPU profiling results become versioned CSV data, how synthetic shapes and microbench measurements are added, how HCCL communication data is collected, and how data is validated before runtime query code consumes it.

1.1 Goals

Provide an offline data-production contract that does not depend on temporary design or planning documents.
Define the database directory structure, compute CSV schema, FIA-specific columns, communication CSV schema, and versioning rules.
Clarify the boundary between op_mapping.yaml, raw profiling parsing, shape-grid generation, operator replay scripts, msprof orchestration, and CSV writeback.
Record tooling boundaries and limitations so contributors know which steps are automated and which steps remain manually maintained.
Give human reviewers a shared checklist for deciding whether CSV/YAML/report changes can enter a production database directory.
Provide a rollout and test plan where most checks run without NPU hardware, while replay/HCCL checks remain isolated to hardware hosts.

1.2 Non-goals

This RFC does not change ProfilingDataSource runtime query behavior or interpolation logic.
This RFC does not define model-level alignment metrics or end-to-end simulation strategy; it only requires the produced data to be consumable by those systems.
This RFC does not require an external database service, remote artifact repository, or online collection pipeline.
This RFC does not claim replay coverage for every profiling kernel.
This RFC does not make temporary planning documents stable interfaces.

2. Solution Design

2.1 Recommended Solution

Define the standard offline collection pipeline as:

flowchart TD
    A[Raw NPU profiling bundle] --> B[parsers/parse_kernel_details.py]
    B --> C[Per-kernel compute CSVs]
    C --> D[fill_fia_runtime_metadata.py]
    C --> E[generate_shape_grid.py]
    D --> E
    E --> F[op_replay/*_run.py]
    F --> G[start_microbench.py under msprof]
    G --> H[CSV writeback and reports]
    I[HCCL benchmark] --> J[hccl/{cann_version}/hcom_*.csv]
    J --> H
    K[op_mapping.yaml] --> F
    K --> H
    H --> L[Versioned profiling_database/data directory]

The flow remains file-based. CSV and YAML are reviewable and diffable, and they remain compatible with the runtime data source. Tools may generate or rewrite rows, but persistent data artifacts stay in versioned database directories.

2.1.0 Tool Scope

This RFC defines the entry points and responsibilities under tools/perf_data_collection/:

Scope	Entry point	Database responsibility
Raw profiling parsing	`parsers/parse_kernel_details.py`	Generate per-kernel compute CSVs from `kernel_details*.csv` or a profiling directory
FIA runtime metadata backfill	`fill_fia_runtime_metadata.py`	Merge FIA JSONL runtime values into `FusedInferAttentionScore.csv`
Shape-grid expansion	`generate_shape_grid.py`, `grid_generator/`, `memory_estimator.py`	Append replayable theory shape rows and filter them by HBM budget
Compute operator replay	`op_replay/*_run.py`, `op_replay/run_all_op.py`, `start_microbench.py`	Replay CSV rows on an NPU host and write microbench latency back
HCCL communication collection	`comm_bench/generate_comm_microbench.py`, `comm_bench/run_comm_bench.sh`, `comm_bench/validate_comm_alignment.py`	Generate or validate `hccl/{cann_version}/hcom_*.csv`
Trace conversion support	`parsers/trace_to_csv.py`	Convert TensorCast Chrome traces to CSV for offline alignment and reports; it does not directly produce operator database CSVs

2.1.1 Database Directory Contract

The database has two persistent data classes:

Data type	Directory	Content	Version scope
Compute/operator data	`tensor_cast/performance_model/profiling_database/data/{device}/vllm_ascend/{version_dir}/`	`op_mapping.yaml` and `{KernelType}.csv`	Device + vLLM-Ascend + PyTorch + CANN stack
Communication data	`tensor_cast/performance_model/profiling_database/data/{device}/hccl/{cann_version}/`	`hcom_*.csv` benchmark files, with optional topology/config metadata	Device + CANN/HCCL stack

{version_dir} follows the helper convention:

vllm{vllm_version}_torch{torch_version}_cann{cann_version}

Recommended directory examples include:

vllm0.13.0_torch2.8.0_cann8.3
vllm0.15.0_torch2.9.0_cann8.5
vllm0.18.0_torch2.9.0_cann8.5
vllm0.18.0_torch2.9.0_cann8.5_shape_generated, a generated/staging example that should not be mistaken for a complete measured directory.

Each compute directory owns one op_mapping.yaml for that stack. Communication CSVs are not copied into every compute directory; op_mapping.yaml points to the communication directory through communication_data_ref, for example:

communication_data_ref: "../../hccl/v8.5/"
communication_fallback: analytic

This allows multiple vLLM-Ascend versions to reuse HCCL measurements when the device and CANN/HCCL stack are unchanged.

2.1.2 Compute CSV Contract

Each compute CSV is named from the NPU profiling Type column after necessary normalization:

{KernelType}.csv

The parser and runtime data source use a CSV contract based on shapes and metadata rather than a Python object schema. The base required columns are:

Column group	Columns	Producer
Kernel identity and shape	`OP State`, `Accelerator Core`, `Input Shapes`, `Input Data Types`, `Input Formats`, `Output Shapes`, `Output Data Types`, `Output Formats`	Raw profiling parser
Profiling latency	`Profiling Average Duration(us)`, `Profiling Median Duration(us)`, `Profiling Std Duration(us)`	Raw profiling parser
Profiling counters	`Profiling Average aicore_time(us)`, `Profiling Average aic_total_cycles`, `Profiling Average aic_mac_time(us)`, and related AIC/AIV utilization columns	Raw profiling parser
Microbench latency	`Average Duration(us)`	`start_microbench.py` writeback
Microbench counters	`MicroBench aicore_time(us)`, `MicroBench aic_total_cycles`, and corresponding `MicroBench ...` counter columns	`start_microbench.py` writeback

Shape fields use semicolons to separate tensor slots and commas to separate dimensions:

"136,7168;7168,3584"

Key rules:

Slot count has semantics. Empty slots must be preserved for FIA, custom kernels, and operators with optional inputs or scalar parameters.
Input Data Types and Input Formats align with Input Shapes by slot index.
FRACTAL_NZ is a valid format; replay and runtime lookup normalize it when needed.
Generated rows may initially have no latency or a zero latency. A row should be considered measured only after replay/writeback fills Average Duration(us) or profiling import fills Profiling Average Duration(us).
Runtime lookup selects latency columns in this order: Average Duration(us), Profiling Average Duration(us), then Duration(us).

2.1.3 FIA CSV Extension Contract

FusedInferAttentionScore.csv needs additional runtime metadata because shape alone cannot fully describe paged attention behavior. The parser and backfill tool use the following columns when available:

Column	Meaning
`Runtime source_profile`	Source profiling subdirectory or source tag
`Runtime actual_seq_lengths_shape` / `Runtime actual_seq_lengths_values`	Query sequence-length metadata
`Runtime actual_seq_lengths_kv_shape` / `Runtime actual_seq_lengths_kv_values`	KV sequence-length metadata
`Runtime avg_seq_len`	Average KV sequence length used by attention lookup and interpolation
`Runtime block_table_shape` / `Runtime block_table_valid_blocks`	Paged cache block-table metadata
`Runtime num_heads` / `Runtime num_key_value_heads`	Runtime head configuration
`Runtime sparse_mode`, `Runtime input_layout`, `Runtime block_size`	FIA execution-mode fields
`Runtime attn_state`, `Runtime kv_cache_mode`	Optional state/cache-mode fields
`Runtime metadata_completeness`	Completeness tag, for example `profile_shapes_only`, `runtime_values`, `runtime_values_dumped`, or `shape_grid_scene_generated`

The parser can recover FIA shape-level metadata from profiling bundles. Full runtime values require a JSONL dump plus fill_fia_runtime_metadata.py. When those values are missing, the replay script falls back to heuristic inference; such rows remain useful, but they should not be treated as fully equivalent reproductions of whole-model runtime state.

2.1.4 Communication CSV Contract

Communication data is stored under the HCCL directory, one CSV per HCCL kernel. Standard filenames are:

hcom_allGather_.csv
hcom_allReduce_.csv
hcom_alltoallv_.csv
hcom_reduceScatter_.csv

Communication CSV columns are split between runtime-required columns and collection-standard columns. ProfilingDataSource runtime lookup requires message_bytes, num_devices, and one usable latency column. topology_tier participates in matching only when the CSV contains it and DeviceProfile.comm_grid can resolve it. dtype and bandwidth_gbps are collection-standard columns for audit and validation; lookup does not filter by dtype. communication_fallback records the intended fallback policy; actual MISS fallback is handled by the upper empirical model.

Column	Level	Type	Meaning
`message_bytes`	Runtime-required	int	Per-rank message size used for lookup and interpolation
`num_devices`	Runtime-required	int	Number of ranks in the communication group
`Average Duration(us)` / `Profiling Average Duration(us)` / `Duration(us)`	One runtime-required latency column	float	Communication latency; runtime selects by the common latency-column priority
`topology_tier`	Conditional match	int	Device topology tier; used when topology can be resolved
`dtype`	Collection-standard	string	Measured dtype, such as `DT_BF16` or `INT8`
`bandwidth_gbps`	Collection-standard	float	Derived bandwidth for audit and validation

ProfilingDataSource first searches for a communication row that exactly matches message_bytes and num_devices; when topology_tier can be resolved and the CSV contains that column, topology_tier must also match. On miss, it finds bracket rows within the same num_devices and available topology tier, fits an alpha-beta latency model from those matching rows, and clamps the predicted latency to the bracket latency bounds. If DeviceProfile.comm_grid is absent, topology filtering is skipped; in that case, the CSV should avoid ambiguous rows for the same message/device count.

2.1.5 Relationship Between `op_mapping.yaml` and Tooling

op_mapping.yaml is both the runtime query contract and a tooling input. It maps TensorCast operators to NPU profiling kernel types and records the evidence chain for each mapping.

Important top-level fields:

Field	Purpose
`version`, `device`, `cann_version`, `pytorch_version`, `op_plugin_version`, `collection_date`	Identify the software stack and collection batch
`communication_data_ref`	Relative path to HCCL communication CSV data
`communication_fallback`	Intended fallback policy; actual MISS fallback is handled by the upper empirical model
`interpolation_policy`	Optional interpolation behavior, such as sqrt transform for FIA
`operator_mappings`	Runtime mapping from TensorCast op name to kernel query rule
`torch_npu_reference`	Reference API metadata that can help create replay scripts

Important operator_mappings fields:

Field	Meaning
`kernel_type`	Primary `{KernelType}.csv` file stem
`alternate_kernel_types`	Candidate CSVs for version drift or kernel variants
`category: communication`	Route the query to communication lookup
`query_mode`	Use special query logic, such as `attention_special`, `elementwise`, or `moe_fused`
`composite`, `sub_kernels`, `decomposer`	Model one TensorCast op as multiple NPU kernels
`tc_input_count`	Truncate TensorCast inputs for matching when TC and NPU signatures differ
`zero_cost`	Mark shape-only or fusion-absorbed ops as measured zero latency
`accepted_miss`	Record expected misses so they do not block validation
`notes`	Evidence chain and review context

The first phase does not require automatically consuming torch_npu_reference.{KernelType}.microbench_api to generate op_replay/<KernelType>_run.py. Replay scripts are maintained manually. When adding or updating a replay script, contributors should use the Microbench Run Script Generator workflow in docs/perf_database/skills/microbench/SKILL.md: generate or update the script from the operator CSV, op_mapping.yaml microbench_api, upstream interface docs/tests, and the local replay conventions. A future generator can build on this workflow, but this RFC treats torch_npu_reference as auxiliary metadata rather than first-phase automation.

2.1.6 Raw Profiling Parser

tools/perf_data_collection/parsers/parse_kernel_details.py is the entry point for converting Ascend profiling output into per-kernel CSV files.

Parser behavior:

--profiling-path may be a single kernel_details*.csv file or a profiling directory.
Directory input is scanned recursively for CSV filenames containing kernel_details.
operator_details.csv and trace_view.json are discovered for FIA bundle inspection and later enrichment.
Kernel details must contain Type, OP State, Accelerator Core, shape/dtype/format columns, Duration(us), and AIC/AIV counter columns.
Known variant names are normalized, including split_qkv_rmsnorm_rope_kernel_0 to split_qkv_rmsnorm_rope_kernel, and muls_add_kernel_1 to muls_add_kernel.
Rows are aggregated by (normalized Type, Input Shapes, Output Shapes).
Output includes average, median, and standard-deviation latency columns, plus averaged hardware counters.
Each operator type is written as {KernelType}.csv under the target database directory.

The parser is conservative: it does not infer missing tensor slots and does not rewrite API semantics. It preserves the profiling representation and leaves shape/API alignment to op_mapping.yaml, generate_shape_grid.py, and replay scripts.

2.1.7 FIA Runtime Metadata Enrichment

FIA enrichment has two layers:

parsers/parse_kernel_details.py can attach shape-level metadata from a profiling bundle and mark rows as Runtime metadata_completeness=profile_shapes_only.
tools/perf_data_collection/fill_fia_runtime_metadata.py can merge runtime JSONL dumps into FusedInferAttentionScore.csv.

JSONL backfill uses LEFT JOIN semantics:

CSV key: query/key/value shapes, KV sequence item count, attention mask shape, and block-table shape.
JSONL key: corresponding runtime tensor metadata.
If multiple runtime records match one CSV row, the row expands to 1:N.
Matched rows receive a configurable metadata tag; the default is runtime_values_dumped.

The minimum JSONL join fields are query_shape, key_shape, value_shape, actual_seq_lengths_kv, atten_mask_shape, and block_table_shape. Optional payload fields include actual_seq_lengths, block_table_valid_blocks, num_heads, num_key_value_heads, sparse_mode, input_layout, and block_size. Each line is one JSON object, for example:

{"query_shape":[8192,16,128],"key_shape":[8192,16,128],"value_shape":[8192,16,128],"actual_seq_lengths_kv":[40,998],"atten_mask_shape":null,"block_table_shape":null,"num_heads":16,"num_key_value_heads":2,"sparse_mode":0,"input_layout":"TND","block_size":128}

This design makes metadata incompleteness explicit. Shape-only FIA rows remain usable, but validation reports should treat profile_shapes_only rows as lower confidence than rows with real runtime sequence values.

2.1.8 Shape Grid Generation

tools/perf_data_collection/generate_shape_grid.py appends theory-generated rows to database CSV files. It does not replace real profiling data and does not produce measured latency by itself.

Shape-grid behavior:

grid_generator/config.yaml routes kernel types to theory patterns.
Template patterns cover GEMM, quantized matmul, elementwise, norm, quantization, RoPE, sampling, KV cache, and shape-manipulation categories.
Fused attention, grouped matmul, and DispatchFFNCombine use more complex Python generators.
--target-models uses known model configurations to prune GEMM (N, K) candidates.
--rows caps rows per CSV; --seed makes sampling reproducible.
--max-hbm-gb filters generated rows whose estimated input/output tensors exceed the memory budget.
Files without a matching theory generator are skipped.

Generated rows should inherit stable structural metadata from source rows and leave performance values blank or zero. They become production data only after start_microbench.py or a later profiling import fills latency.

2.1.9 Per-kernel `op_replay` Script Framework

tools/perf_data_collection/op_replay/ stores per-kernel replay implementations. Each covered kernel should provide a matching *_run.py script for categories such as matmul, quantization, FIA, KV cache, softmax, sort, transpose, and vLLM-Ascend custom kernels.

Shared conventions:

common.py handles database path resolution, software-stack version directory naming, dtype/format parsing, tensor construction, repeat-count handling, and invalid-row tracking.
replay_framework.py provides an OpReplay helper for standard API-style kernels.
run_all_op.py discovers *_run.py, supports --execution-mode inprocess for one outer msprof session, and writes run_all_op_status.json.
Each operator script reads the matching {KernelType}.csv, replays each row on NPU, prints concise [OK] messages, and deletes invalid rows when replay case construction fails.
When --repeat-count is omitted, MSMODELING_OP_REPLAY_REPEAT_COUNT can provide the default repeat count.
When adding or modifying <KernelType>_run.py, contributors should follow the Microbench Run Script Generator skill's core steps: read the target CSV, locate torch_npu_reference.<KernelType>.microbench_api, confirm the real API from upstream repo docs/tests, infer missing non-tensor arguments, reuse common.py / replay_framework.py conventions, and validate at least py_compile plus --help.

Coverage limitation: replay coverage can expand incrementally. A CSV may exist without a corresponding replay script, and some scripts only support their target recorded shape/API patterns. Custom operators also require correct ASCEND_CUSTOM_OPP_PATH and LD_LIBRARY_PATH. Regular CI coverage for NPU-dependent replay is necessarily limited; most tests cover imports, CLI help, pure parsing logic, and writeback units.

2.1.10 `start_microbench.py` `msprof` Orchestration and Writeback

tools/perf_data_collection/start_microbench.py is the production entry point for replay measurement writeback.

Flow:

Resolve the target database directory through --database-path or device/version parameters.
Validate selected operators and required custom OPP environment variables.
Run msprof --output=... python op_replay/run_all_op.py --execution-mode inprocess.
Locate generated PROF_* directories and op_summary_*.csv files.
Aggregate rows by operator type and signature.
Write Average Duration(us) and MicroBench ... counter columns back to matching database CSV rows.
Generate a markdown update report and duration-gap CSV under reports/.

Row matching uses:

Input Shapes
Input Data Types
Input Formats
Output Shapes
Output Data Types

DispatchFFNCombine also includes EP Size in the signature. Update modes:

Mode	Behavior
`all`	Update all matching rows and append unmatched profiling samples
`missing-only`	Replay/fill only rows whose microbench and profiling latencies are invalid; unmatched samples are reported but not appended

--prof-path can parse an existing PROF_* directory without launching msprof. --prune-empty-duration-rows deletes rows that still have no valid replay or profiling latency after writeback.

Timing caveat: Average Duration(us) is the microbench value aggregated from op_summary_*.csv task duration. It is the runtime lookup's preferred latency, but comparison reports should still keep Profiling Average Duration(us) to expose replay versus whole-model profiling gaps.

2.1.11 HCCL Communication Benchmark

Communication collection is separate from compute replay because HCCL latency depends on rank-group size and topology.

tools/perf_data_collection/comm_bench/generate_comm_microbench.py supports:

Operators: all-reduce, all-gather, reduce-scatter, all-to-all.
--bytes-grid for the message-bytes grid.
--num-devices.
--topology-tier.
--grid-shape for hardware topology.
dtype selection.
--bench-mode kernel or --bench-mode event.
--do-run for direct execution under torchrun.
--output-dir for per-op files, or --output-csv for a single file.

tools/perf_data_collection/comm_bench/run_comm_bench.sh is the batch wrapper. validate_comm_alignment.py checks whether measured rows align with the parser/query model under a configurable ratio tolerance.

Limitations:

The query convention uses lowercase hcom_ prefixes, for example hcom_reduceScatter_.csv; CamelCase graph-compiled names are version-drift clues but not primary HCCL CSV filenames.
Communication dtype is recorded in CSV, but the lookup path primarily matches message size, device count, and topology tier.
If topology tier cannot be resolved from DeviceProfile, communication data must be organized to avoid ambiguity by directory and row contents.

2.1.12 Validation and Reports

Validation should cover three levels:

Level	Checks	Representative command
Static/tooling	Python syntax, CLI help, schema unit tests, shape parser tests, HCCL benchmark pure-logic tests	`pytest tests/tools/test_op_replay.py tests/tools/test_op_replay_common.py tests/tools/test_start_microbench.py tests/tools/test_generate_shape_grid.py tests/tools/test_fia_parser_backfill.py tests/tools/test_generate_comm_microbench.py tests/tools/test_validate_comm_alignment.py`
Database/query	`op_mapping.yaml` schema, `ProfilingDataSource` lookup, interpolation, FIA enriched lookup	`pytest tests/perf_database`
Hardware-dependent	NPU replay, HCCL benchmark, writeback from real `msprof` output	Run `pytest -m npu` and selected `start_microbench.py` / `comm_bench` commands on an NPU host

Each data refresh should produce:

List of created or updated CSV files.
Replay status summary from run_all_op_status.json.
Profile update report from start_microbench.py.
Duration-gap hotspot CSV when both replay and whole-model profiling latency exist.
List of skipped operators or missing replay scripts.
List of generated rows that still lack valid latency.

Validation must not hide partial coverage. Accepted miss, zero-cost ops, interpolation, partial composite hit, and FIA metadata completeness must remain visible in reports or CSV metadata.

2.1.13 Tool CLI Contract

Tool	Required input	Main output	Key failure modes
`parsers/parse_kernel_details.py`	`--profiling-path`; output directory from `--database-path` or device/version parameters	Per-kernel `{KernelType}.csv`	Fails when input lacks required profiling columns
`fill_fia_runtime_metadata.py`	`--csv-path`, `--jsonl-path`	Enriched `FusedInferAttentionScore.csv`; overwrites in place by default, or uses `--output-path`	Fails when CSV/JSONL is missing, CSV header is empty, or JSONL is invalid
`generate_shape_grid.py`	`--data-dir`, or `--device --vllm-version [--torch-version --cann-version]`	Appends theory shape rows to CSVs	Fails when data dir is missing or contains no CSV; CSVs without a generator are skipped and reported
`start_microbench.py`	`--database-path`, or device/version parameters; optional `--prof-path`	Writes `Average Duration(us)` and `MicroBench ...` counters, generates `reports/`	`msprof` missing, OPP env missing, no `PROF_` / `op_summary_.csv`; duplicate signatures, unmatched rows, and gaps enter the report
`op_replay/run_all_op.py`	Database path parameters; optional `--op`	Executes matching `*_run.py` scripts and writes `run_all_op_status.json`	Single-op failure; default is fail-fast, `--continue-on-error` runs remaining scripts
`comm_bench/generate_comm_microbench.py`	`torchrun ... --do-run`	`hcom_*.csv`	Without `--do-run`, no collection is executed; `--output-csv` allows only one `--ops` value
`comm_bench/validate_comm_alignment.py`	`--csv-dir`	PASS/WARN/FAIL alignment report; nonzero exit on FAIL	Fails when `--csv-dir` is not a directory; ratios beyond tolerance fail; malformed rows are not counted, so reviewers should check 0-row reports

2.1.14 Human Review Checklist

Before a data refresh or tooling change is sent for human review, it should provide at least:

Target software stack: device, vLLM-Ascend, PyTorch, CANN, op-plugin, and collection date.
Raw input sources: profiling bundle name or path summary, whether operator_details.csv / trace_view.json exist, and whether FIA runtime JSONL exists.
CSV schema changes: added, removed, or renamed columns; whether empty tensor slots are preserved; whether latency-column priority is still satisfied.
op_mapping.yaml changes: added/modified mappings, evidence chains, alternate_kernel_types, tc_input_count, accepted_miss, zero_cost, and composite/decomposer impact.
Shape-grid changes: --target-models, --rows, --seed, --max-hbm-gb, and whether generated-only rows have been replayed.
Replay results: run_all_op_status.json summary, profile update report, duration-gap hotspot CSV, invalid rows, and duplicate signatures.
HCCL results: hcom_*.csv file list, num_devices / topology_tier coverage, bench-mode, validate_comm_alignment.py tolerance, and result.
Known limitations: missing replay scripts, incomplete FIA metadata, staging-only generated directories, and reasons hardware-dependent tests were not run.

2.1.15 Data Lifecycle and Version Management

Recommended lifecycle:

Create or select the compute directory for a software stack.
Import raw profiling with parsers/parse_kernel_details.py.
Add or update op_mapping.yaml in the same software-stack directory.
Enrich FIA metadata if runtime JSONL values are available.
Generate shape rows in a staging copy or clearly named generated directory.
First replay selected operators with start_microbench.py --update-mode missing-only.
Review writeback reports, missing shapes, invalid rows, duplicate signatures, and duration gaps.
Submit CSV/YAML changes together with changelog or reports, then promote the directory.

Versioning rules:

Isolate software-stack changes by {version_dir}. Do not silently mix data from different vLLM-Ascend, PyTorch, or CANN versions.
Store HCCL data under {device}/hccl/{cann_version} and reuse it through communication_data_ref.
Update collection_date when a database directory receives a new profiling import or a large replay refresh.
Keep published version directories unless there is an explicit deprecation notice.
Use clear names for generated-only directories so users do not confuse them with complete measured data.
Do not commit raw PROF_* directories unless review explicitly requires them; commit derived CSVs and report artifacts.

2.2 Alternative Solutions

Option A: Use Only Whole-model Raw Profiling

Import only whole-model profiling rows, without shape-grid generation or microbenchmark replay.

This is simple and closest to real workloads, but shape coverage is sparse. It cannot fill important neighboring shapes and makes runtime lookup overly dependent on one captured model/profile.

Option B: Generate All Replay Scripts from `torch_npu_reference`

Read op_mapping.yaml and automatically generate all op_replay/<KernelType>_run.py files from torch_npu_reference.{KernelType}.microbench_api.

This is a better long-term direction, but it is outside the first-phase scope. Many kernels need non-tensor parameters, custom OPP settings, preserved slots, valid cache tensors, or version-dependent API behavior that cannot be safely inferred from a single API string.

Option C: Use an External Database

Move CSV/YAML data to SQLite, DuckDB, or a service database.

This would improve indexing and schema validation, but it would add deployment and review cost in the first phase. The recommended solution keeps the runtime data source and review flow centered on file artifacts.

Option D: Replace Measurement with Analytic Models

Avoid replay collection and use analytic formulas for both compute and communication.

Analytic models are appropriate fallbacks, especially for communication, but they cannot accurately capture fusion, CANN kernel selection, tensor formats, and custom-op behavior.

2.3 Solution Analysis

Advantages

Keeps data production offline, repeatable, and reviewable.
Adopts the runtime contract: op_mapping.yaml plus per-kernel CSV files.
Splits raw profiling import, metadata enrichment, shape expansion, replay measurement, and validation into clear tooling responsibilities.
Allows partial data to be explicit instead of blocking an entire database refresh.
Reuses HCCL measurements across compatible compute-stack directories.
Supports exact measured rows and generated rows that can be filled incrementally.

Limitations

Replay scripts require manual maintenance, and replay coverage expands incrementally.
The first phase does not auto-consume torch_npu_reference to generate replay scripts.
NPU-dependent replay and communication tests have limited regular CI coverage.
CSV schema is easy to diff, but also easy to damage through manual edits.
FIA replay can remain only an approximate reproduction when runtime sequence metadata is missing.
Shape-grid generation may produce rows that look structurally valid but cannot be replayed by a specific operator API.

3. Implementation Plan

3.1 Rollout Plan

The schedule is phase-ordered. Phases 0-2 can run on any development host. Phases 3-4 depend on prepared NPU host windows. Phase 5 starts only after replay and communication reports are available.

Milestone	Owner role and order	Scope	Exit criteria
Phase 0: RFC and contract	RFC owner; first phase before implementation work	Adopt this RFC as the standalone collection-tooling contract	Contributors can execute the flow without temporary documents
Phase 1: Parser/schema baseline	Tooling owner; no-NPU phase after Phase 0	Validate raw parsing, output column order, and software-stack directory naming	Existing no-NPU parser and data-source tests pass
Phase 2: Shape-grid staging	Data collection owner; no-NPU staging phase after Phase 1	Use `--rows`, `--seed`, and `--max-hbm-gb` in a staging directory	Generated rows preserve slot/dtype/format contracts and report skipped kernels
Phase 3: Replay writeback	Data collection owner and NPU host owner; scheduled when an NPU host is available	Run selected `op_replay` scripts through `start_microbench.py` on an NPU host	`Average Duration(us)` is filled, invalid rows are reported, and gap reports are generated
Phase 4: Communication collection	HCCL/NPU owner; scheduled with the same or adjacent NPU host window	Collect or refresh HCCL CSVs for the target CANN version	`validate_comm_alignment.py` passes under the agreed tolerance
Phase 5: Promotion	Data reviewer and runtime owner; after Phase 3-4 report review	Review CSV/YAML/report changes and promote the database directory	Runtime lookup tests pass and known limitations are recorded in the refresh report

3.2 Test Plan

Run no-NPU checks before any hardware collection:

python -m pytest tests/tools/test_op_replay.py \
  tests/tools/test_op_replay_common.py \
  tests/tools/test_start_microbench.py \
  tests/tools/test_generate_shape_grid.py \
  tests/tools/test_fia_parser_backfill.py \
  tests/tools/test_generate_comm_microbench.py \
  tests/tools/test_validate_comm_alignment.py

python -m pytest tests/perf_database

Run syntax and help checks for changed replay scripts:

python -m py_compile tools/perf_data_collection/op_replay/<KernelType>_run.py
python tools/perf_data_collection/op_replay/<KernelType>_run.py --help

Run hardware-dependent checks only on a prepared NPU host:

python tools/perf_data_collection/start_microbench.py \
  --database-path tensor_cast/performance_model/profiling_database/data/{device}/vllm_ascend/{version_dir} \
  --op MatMulV2 RmsNorm \
  --repeat-count 1 \
  --update-mode missing-only

bash tools/perf_data_collection/comm_bench/run_comm_bench.sh \
  tensor_cast/performance_model/profiling_database/data/{device}/hccl/{cann_version}

3.3 Follow-up Work

Add an in-repo replay-script generator that consumes torch_npu_reference.{KernelType}.microbench_api and operator-specific metadata.
Add stricter CSV schema validation and duplicate-signature checks before writeback.
Expand NPU-marked integration tests to cover representative custom operators.
Improve FIA runtime metadata capture to reduce replay rows that depend on heuristic inference.
Add a single collection command that orchestrates parse, enrich, shape generation, replay, communication validation, and report packaging while preserving this file contract.

3.4 Completion Criteria

tools/perf_data_collection/ entry points, database directories, CSV schemas, and the op_mapping.yaml contract match this RFC.
At least one target software-stack directory can be generated through parse, optional enrich, shape generation, replay writeback, and report generation.
The HCCL directory can be generated by communication benchmark tooling and referenced through communication_data_ref.
No-NPU tests cover parser behavior, shape grid, FIA backfill, start_microbench writeback logic, communication CSV validation, and data-source query behavior.
On an NPU host, selected replay and communication benchmark commands can run and produce status and gap reports for human review.