Yyrewzjsxmove transfer performance examples to benchmark

f5dc08bb创建于 5月6日历史提交

文件	最后提交记录	最后更新时间
README.md	move transfer performance examples to benchmark Co-authored-by: wonder1121<wangdan@huawei.com> # message auto-generated for no-merge-commit merge: !564 merge develop_transfer into develop move transfer performance examples to benchmark Created-by: wonder1121 Commit-by: wonder1121 Merged-by: yrewzjsx Description: ==================================================Trans Test Start================================================== Test completed: latency 67.45us, block size 32KB, total threads=2, per-thread times=100, aggregated throughput 6.81 GB/s Test completed: latency 65.35us, block size 64KB, total threads=2, per-thread times=100, aggregated throughput 14.60 GB/s Test completed: latency 65.62us, block size 128KB, total threads=2, per-thread times=100, aggregated throughput 23.32 GB/s Test completed: latency 75.19us, block size 256KB, total threads=2, per-thread times=100, aggregated throughput 24.00 GB/s Test completed: latency 66.34us, block size 512KB, total threads=2, per-thread times=100, aggregated throughput 24.32 GB/s Test completed: latency 84.17us, block size 1024KB, total threads=2, per-thread times=100, aggregated throughput 24.42 GB/s Test completed: latency 95.28us, block size 2048KB, total threads=2, per-thread times=100, aggregated throughput 24.47 GB/s Test completed: latency 175.10us, block size 4096KB, total threads=2, per-thread times=100, aggregated throughput 24.49 GB/s Test completed: latency 346.41us, block size 8192KB, total threads=2, per-thread times=100, aggregated throughput 24.50 GB/s Test completed: latency 691.24us, block size 16384KB, total threads=2, per-thread times=100, aggregated throughput 24.50 GB/s ==================================================Test End================================================== # Pull Request Template ## Description <!-- Provide a clear summary of the change, the problem it solves, and the technical approach. --> ## Related Issues <!-- Replace with actual issue numbers. Example: Closes #123, Related to #456 --> ## Type of Change - [ ] Bug fix - [ ] New feature - [ ] Performance optimization - [ ] Documentation update - [ ] Other (please describe):_______________ ## Testing <!-- Describe how the changes were tested. Include: - Unit/integration test coverage - Manual validation steps - Links to CI reports (e.g., CIDA) if available - Note any pending tests --> ## Checklist - [ ] I have performed a self-review of my own code. - [ ] I have updated the documentation. - [ ] I have added tests to prove my changes are effective. See merge request: Ascend/memfabric_hybrid!564	30 天前
test_quant_trans.py	move transfer performance examples to benchmark Co-authored-by: wonder1121<wangdan@huawei.com> # message auto-generated for no-merge-commit merge: !564 merge develop_transfer into develop move transfer performance examples to benchmark Created-by: wonder1121 Commit-by: wonder1121 Merged-by: yrewzjsx Description: ==================================================Trans Test Start================================================== Test completed: latency 67.45us, block size 32KB, total threads=2, per-thread times=100, aggregated throughput 6.81 GB/s Test completed: latency 65.35us, block size 64KB, total threads=2, per-thread times=100, aggregated throughput 14.60 GB/s Test completed: latency 65.62us, block size 128KB, total threads=2, per-thread times=100, aggregated throughput 23.32 GB/s Test completed: latency 75.19us, block size 256KB, total threads=2, per-thread times=100, aggregated throughput 24.00 GB/s Test completed: latency 66.34us, block size 512KB, total threads=2, per-thread times=100, aggregated throughput 24.32 GB/s Test completed: latency 84.17us, block size 1024KB, total threads=2, per-thread times=100, aggregated throughput 24.42 GB/s Test completed: latency 95.28us, block size 2048KB, total threads=2, per-thread times=100, aggregated throughput 24.47 GB/s Test completed: latency 175.10us, block size 4096KB, total threads=2, per-thread times=100, aggregated throughput 24.49 GB/s Test completed: latency 346.41us, block size 8192KB, total threads=2, per-thread times=100, aggregated throughput 24.50 GB/s Test completed: latency 691.24us, block size 16384KB, total threads=2, per-thread times=100, aggregated throughput 24.50 GB/s ==================================================Test End================================================== # Pull Request Template ## Description <!-- Provide a clear summary of the change, the problem it solves, and the technical approach. --> ## Related Issues <!-- Replace with actual issue numbers. Example: Closes #123, Related to #456 --> ## Type of Change - [ ] Bug fix - [ ] New feature - [ ] Performance optimization - [ ] Documentation update - [ ] Other (please describe):_______________ ## Testing <!-- Describe how the changes were tested. Include: - Unit/integration test coverage - Manual validation steps - Links to CI reports (e.g., CIDA) if available - Note any pending tests --> ## Checklist - [ ] I have performed a self-review of my own code. - [ ] I have updated the documentation. - [ ] I have added tests to prove my changes are effective. See merge request: Ascend/memfabric_hybrid!564	30 天前
test_transfer_engine.py	move transfer performance examples to benchmark Co-authored-by: wonder1121<wangdan@huawei.com> # message auto-generated for no-merge-commit merge: !564 merge develop_transfer into develop move transfer performance examples to benchmark Created-by: wonder1121 Commit-by: wonder1121 Merged-by: yrewzjsx Description: ==================================================Trans Test Start================================================== Test completed: latency 67.45us, block size 32KB, total threads=2, per-thread times=100, aggregated throughput 6.81 GB/s Test completed: latency 65.35us, block size 64KB, total threads=2, per-thread times=100, aggregated throughput 14.60 GB/s Test completed: latency 65.62us, block size 128KB, total threads=2, per-thread times=100, aggregated throughput 23.32 GB/s Test completed: latency 75.19us, block size 256KB, total threads=2, per-thread times=100, aggregated throughput 24.00 GB/s Test completed: latency 66.34us, block size 512KB, total threads=2, per-thread times=100, aggregated throughput 24.32 GB/s Test completed: latency 84.17us, block size 1024KB, total threads=2, per-thread times=100, aggregated throughput 24.42 GB/s Test completed: latency 95.28us, block size 2048KB, total threads=2, per-thread times=100, aggregated throughput 24.47 GB/s Test completed: latency 175.10us, block size 4096KB, total threads=2, per-thread times=100, aggregated throughput 24.49 GB/s Test completed: latency 346.41us, block size 8192KB, total threads=2, per-thread times=100, aggregated throughput 24.50 GB/s Test completed: latency 691.24us, block size 16384KB, total threads=2, per-thread times=100, aggregated throughput 24.50 GB/s ==================================================Test End================================================== # Pull Request Template ## Description <!-- Provide a clear summary of the change, the problem it solves, and the technical approach. --> ## Related Issues <!-- Replace with actual issue numbers. Example: Closes #123, Related to #456 --> ## Type of Change - [ ] Bug fix - [ ] New feature - [ ] Performance optimization - [ ] Documentation update - [ ] Other (please describe):_______________ ## Testing <!-- Describe how the changes were tested. Include: - Unit/integration test coverage - Manual validation steps - Links to CI reports (e.g., CIDA) if available - Note any pending tests --> ## Checklist - [ ] I have performed a self-review of my own code. - [ ] I have updated the documentation. - [ ] I have added tests to prove my changes are effective. See merge request: Ascend/memfabric_hybrid!564	30 天前

mf_adapter数据传输工具示例

简介

本工具基于 mf_adapter 库实现了两个节点（Prefill 节点与 Decode 节点）之间的高效数据传输功能，适用于基于昇腾 NPU 的分布式计算场景。通过内存注册与直接内存访问，实现了跨节点数据的快速传输，避免了传统 IO 操作的性能开销。

功能说明

工具包含两种工作角色：

Prefill 角色：负责生成数据并将数据传输到目标节点
Decode 角色：负责接收并处理 Prefill 节点传输的数据

两者通过配置存储服务（store_url）进行协调，通过 RPC 端口进行通信，支持大规模张量数据（torch.Tensor）的高效传输。

使用方法

基本命令格式

python test\_transfer\_engine.py \[参数]

参数说明

参数名	必选	说明
--role	是	工作角色，可选值：`Prefill` 或 `Decode`
--src-unique-id	否	当前节点的会话 ID，格式：`ip:port`
--store-url	是	配置存储服务地址，格式：`tcp://ip:port` 或者 `tcp://[ip]:port`
--npu-id	否	NPU 设备 ID，默认值：0
--dst-unique-id	否	目标节点会话 ID（仅 Prefill 角色需要），格式：`ip:port`
--log-level	否	日志级别：0 (debug)、1 (info)、2 (warn)、3 (error)，默认值：0

运行步骤

1. 启动 Decode 节点

python test_transfer_engine.py \
    --role Decode \
    --src-unique-id "127.0.0.1:50051" \
    --store-url tcp://127.0.0.1:8000 \
    --npu-id 0 \
    --log-level 1

2. 启动 Prefill 节点

python test_transfer_engine.py \
    --role Prefill \
    --src-unique-id "127.0.0.1:50052" \
    --store-url tcp://127.0.0.1:8000 \
    --npu-id 1 \
    --dst-unique-id "127.0.0.1:50051" \
    --log-level 1

工作流程说明

初始化阶段：

两个节点分别启动并通过engine.initialize()初始化 TransferEngine
Decode 节点与 Prefill 节点通过store_url进行配置同步，两个节点必须使用相同的store_url配置以确保能够相互发现
双方通过各自的src-unique-id和目标节点的dst-unique-id建立通信通道，Prefill 节点的dst-unique-id必须与 Decode 节点的 src-unique-id完全一致（包括端口号）

数据传输阶段：

Decode 节点创建接收缓冲区（大小为(10, 50, 40, 20, 60)的 float16 张量）并注册内存地址
Prefill 节点生成随机张量数据，注册内存后通过transfer_sync_write方法将数据传输到 Decode 节点
数据通过直接内存访问方式传输，避免数据拷贝

随路量化测试

将测试脚本由test_transfer_engine.py替换为test_quant_trans.py,参数不变即可，需要手动观测日志打印的ret_quant和ret_scale的sum是否误差在可接受范围内

故障排查

初始化失败：

检查 store_url 是否可访问（可使用telnet ip port验证端口连通性），需使用所有节点均可访问的 TCP 地址，建议使用固定端口（如 tcp://``192.168.1.1:23456）
确认src-unique-id格式正确且未被占用
验证 NPU 设备是否正常工作（可通过npu-smi info命令检查）

数据传输失败：

检查 Prefill 节点的dst-unique-id是否与 Decode 节点的src-unique-id完全一致
确认两个节点的内存注册成功（日志中包含register success）
查看 debug 级日志（--log-level 0）获取详细错误信息