文件最后提交记录最后更新时间
move transfer performance examples to benchmark Co-authored-by: wonder1121<wangdan@huawei.com> # message auto-generated for no-merge-commit merge: !564 merge develop_transfer into develop move transfer performance examples to benchmark Created-by: wonder1121 Commit-by: wonder1121 Merged-by: yrewzjsx Description: ==================================================Trans Test Start================================================== Test completed: latency 67.45us, block size 32KB, total threads=2, per-thread times=100, aggregated throughput 6.81 GB/s Test completed: latency 65.35us, block size 64KB, total threads=2, per-thread times=100, aggregated throughput 14.60 GB/s Test completed: latency 65.62us, block size 128KB, total threads=2, per-thread times=100, aggregated throughput 23.32 GB/s Test completed: latency 75.19us, block size 256KB, total threads=2, per-thread times=100, aggregated throughput 24.00 GB/s Test completed: latency 66.34us, block size 512KB, total threads=2, per-thread times=100, aggregated throughput 24.32 GB/s Test completed: latency 84.17us, block size 1024KB, total threads=2, per-thread times=100, aggregated throughput 24.42 GB/s Test completed: latency 95.28us, block size 2048KB, total threads=2, per-thread times=100, aggregated throughput 24.47 GB/s Test completed: latency 175.10us, block size 4096KB, total threads=2, per-thread times=100, aggregated throughput 24.49 GB/s Test completed: latency 346.41us, block size 8192KB, total threads=2, per-thread times=100, aggregated throughput 24.50 GB/s Test completed: latency 691.24us, block size 16384KB, total threads=2, per-thread times=100, aggregated throughput 24.50 GB/s ==================================================Test End================================================== # Pull Request Template ## Description <!-- Provide a clear summary of the change, the problem it solves, and the technical approach. --> ## Related Issues <!-- Replace with actual issue numbers. Example: Closes #123, Related to #456 --> ## Type of Change - [ ] Bug fix - [ ] New feature - [ ] Performance optimization - [ ] Documentation update - [ ] Other (please describe):_______________ ## Testing <!-- Describe how the changes were tested. Include: - Unit/integration test coverage - Manual validation steps - Links to CI reports (e.g., CIDA) if available - Note any pending tests --> ## Checklist - [ ] I have performed a self-review of my own code. - [ ] I have updated the documentation. - [ ] I have added tests to prove my changes are effective. See merge request: Ascend/memfabric_hybrid!56430 天前
move transfer performance examples to benchmark Co-authored-by: wonder1121<wangdan@huawei.com> # message auto-generated for no-merge-commit merge: !564 merge develop_transfer into develop move transfer performance examples to benchmark Created-by: wonder1121 Commit-by: wonder1121 Merged-by: yrewzjsx Description: ==================================================Trans Test Start================================================== Test completed: latency 67.45us, block size 32KB, total threads=2, per-thread times=100, aggregated throughput 6.81 GB/s Test completed: latency 65.35us, block size 64KB, total threads=2, per-thread times=100, aggregated throughput 14.60 GB/s Test completed: latency 65.62us, block size 128KB, total threads=2, per-thread times=100, aggregated throughput 23.32 GB/s Test completed: latency 75.19us, block size 256KB, total threads=2, per-thread times=100, aggregated throughput 24.00 GB/s Test completed: latency 66.34us, block size 512KB, total threads=2, per-thread times=100, aggregated throughput 24.32 GB/s Test completed: latency 84.17us, block size 1024KB, total threads=2, per-thread times=100, aggregated throughput 24.42 GB/s Test completed: latency 95.28us, block size 2048KB, total threads=2, per-thread times=100, aggregated throughput 24.47 GB/s Test completed: latency 175.10us, block size 4096KB, total threads=2, per-thread times=100, aggregated throughput 24.49 GB/s Test completed: latency 346.41us, block size 8192KB, total threads=2, per-thread times=100, aggregated throughput 24.50 GB/s Test completed: latency 691.24us, block size 16384KB, total threads=2, per-thread times=100, aggregated throughput 24.50 GB/s ==================================================Test End================================================== # Pull Request Template ## Description <!-- Provide a clear summary of the change, the problem it solves, and the technical approach. --> ## Related Issues <!-- Replace with actual issue numbers. Example: Closes #123, Related to #456 --> ## Type of Change - [ ] Bug fix - [ ] New feature - [ ] Performance optimization - [ ] Documentation update - [ ] Other (please describe):_______________ ## Testing <!-- Describe how the changes were tested. Include: - Unit/integration test coverage - Manual validation steps - Links to CI reports (e.g., CIDA) if available - Note any pending tests --> ## Checklist - [ ] I have performed a self-review of my own code. - [ ] I have updated the documentation. - [ ] I have added tests to prove my changes are effective. See merge request: Ascend/memfabric_hybrid!56430 天前
move transfer performance examples to benchmark Co-authored-by: wonder1121<wangdan@huawei.com> # message auto-generated for no-merge-commit merge: !564 merge develop_transfer into develop move transfer performance examples to benchmark Created-by: wonder1121 Commit-by: wonder1121 Merged-by: yrewzjsx Description: ==================================================Trans Test Start================================================== Test completed: latency 67.45us, block size 32KB, total threads=2, per-thread times=100, aggregated throughput 6.81 GB/s Test completed: latency 65.35us, block size 64KB, total threads=2, per-thread times=100, aggregated throughput 14.60 GB/s Test completed: latency 65.62us, block size 128KB, total threads=2, per-thread times=100, aggregated throughput 23.32 GB/s Test completed: latency 75.19us, block size 256KB, total threads=2, per-thread times=100, aggregated throughput 24.00 GB/s Test completed: latency 66.34us, block size 512KB, total threads=2, per-thread times=100, aggregated throughput 24.32 GB/s Test completed: latency 84.17us, block size 1024KB, total threads=2, per-thread times=100, aggregated throughput 24.42 GB/s Test completed: latency 95.28us, block size 2048KB, total threads=2, per-thread times=100, aggregated throughput 24.47 GB/s Test completed: latency 175.10us, block size 4096KB, total threads=2, per-thread times=100, aggregated throughput 24.49 GB/s Test completed: latency 346.41us, block size 8192KB, total threads=2, per-thread times=100, aggregated throughput 24.50 GB/s Test completed: latency 691.24us, block size 16384KB, total threads=2, per-thread times=100, aggregated throughput 24.50 GB/s ==================================================Test End================================================== # Pull Request Template ## Description <!-- Provide a clear summary of the change, the problem it solves, and the technical approach. --> ## Related Issues <!-- Replace with actual issue numbers. Example: Closes #123, Related to #456 --> ## Type of Change - [ ] Bug fix - [ ] New feature - [ ] Performance optimization - [ ] Documentation update - [ ] Other (please describe):_______________ ## Testing <!-- Describe how the changes were tested. Include: - Unit/integration test coverage - Manual validation steps - Links to CI reports (e.g., CIDA) if available - Note any pending tests --> ## Checklist - [ ] I have performed a self-review of my own code. - [ ] I have updated the documentation. - [ ] I have added tests to prove my changes are effective. See merge request: Ascend/memfabric_hybrid!56430 天前
move transfer performance examples to benchmark Co-authored-by: wonder1121<wangdan@huawei.com> # message auto-generated for no-merge-commit merge: !564 merge develop_transfer into develop move transfer performance examples to benchmark Created-by: wonder1121 Commit-by: wonder1121 Merged-by: yrewzjsx Description: ==================================================Trans Test Start================================================== Test completed: latency 67.45us, block size 32KB, total threads=2, per-thread times=100, aggregated throughput 6.81 GB/s Test completed: latency 65.35us, block size 64KB, total threads=2, per-thread times=100, aggregated throughput 14.60 GB/s Test completed: latency 65.62us, block size 128KB, total threads=2, per-thread times=100, aggregated throughput 23.32 GB/s Test completed: latency 75.19us, block size 256KB, total threads=2, per-thread times=100, aggregated throughput 24.00 GB/s Test completed: latency 66.34us, block size 512KB, total threads=2, per-thread times=100, aggregated throughput 24.32 GB/s Test completed: latency 84.17us, block size 1024KB, total threads=2, per-thread times=100, aggregated throughput 24.42 GB/s Test completed: latency 95.28us, block size 2048KB, total threads=2, per-thread times=100, aggregated throughput 24.47 GB/s Test completed: latency 175.10us, block size 4096KB, total threads=2, per-thread times=100, aggregated throughput 24.49 GB/s Test completed: latency 346.41us, block size 8192KB, total threads=2, per-thread times=100, aggregated throughput 24.50 GB/s Test completed: latency 691.24us, block size 16384KB, total threads=2, per-thread times=100, aggregated throughput 24.50 GB/s ==================================================Test End================================================== # Pull Request Template ## Description <!-- Provide a clear summary of the change, the problem it solves, and the technical approach. --> ## Related Issues <!-- Replace with actual issue numbers. Example: Closes #123, Related to #456 --> ## Type of Change - [ ] Bug fix - [ ] New feature - [ ] Performance optimization - [ ] Documentation update - [ ] Other (please describe):_______________ ## Testing <!-- Describe how the changes were tested. Include: - Unit/integration test coverage - Manual validation steps - Links to CI reports (e.g., CIDA) if available - Note any pending tests --> ## Checklist - [ ] I have performed a self-review of my own code. - [ ] I have updated the documentation. - [ ] I have added tests to prove my changes are effective. See merge request: Ascend/memfabric_hybrid!56430 天前
README.md

数据传输性能工具示例

简介

本工具主要用于验证底层的传输性能,可以分别配置SDMA和RDMA。

使用方法

编译安装

  1. 单独编译

已经安装了run包并设置了环境变量,在当前目录下执行:

  mkdir build
  cmake . -B build
  make -C build
  1. 随run包一起编译打包
bash script/build_and_pack_run.sh --build_mode RELEASE --build_python ON --xpu_type NPU --build_test ON

bash output/memfabric_hybrid-1.0.0_linux_aarch64.run # 修改为实际编译出来的run文件
source /usr/local/memfabric_hybrid/set_env.sh
cd /usr/local/memfabric_hybrid/latest/aarch64-linux/test

run包安装后transfer_perf在/usr/local/memfabric_hybrid/latest/aarch64-linux/test目录下。

基本命令格式

# transfer_perf {rankSize} {rankId} {deviceId} {useSdma} tcp://{ip}:{port} {memType}
或者
# transfer_perf {rankSize} {rankId} {deviceId} {useSdma} tcp://{[ipv6]}:{port} {memType}

# deviceId=2
./transfer_perf 2 0 2 1 tcp://127.0.0.1:12050 0
(./transfer_perf 2 0 2 1 tcp://[::1]:12050) 0

# deviceId=3
./transfer_perf 2 1 3 1 tcp://127.0.0.1:12050 0
(./transfer_perf 2 1 3 1 tcp://[::1]:12050) 0

参数说明

参数名 必选 说明
rankSize 一共多少个rank
rankId 当前节点的rankId
deviceId 当前节点的deviceId
useSdma 1使用SDMA,0使用RDMA
tcp://{Ip}:{port} 配置存储服务地址,格式:tcp://ip:port 或者 tcp://[ipv6]:port。configStore的server的监听ip和端口。关于 configStore 配置存储系统的说明,请参考 config_store_cluster_ha
memType 内存介质类型, 0:hbm 1:dram 2:hbm + dram

运行步骤

修改run.sh中的参数后,可以通过直接运行run.sh获取性能

source /usr/local/memfabric_hybrid/set_env.sh

bash run.sh

实验结果(A3 pod内SDMA传输验证)

Device: 0->1 传输性能

Test completed: latency 12.5us, block size 32KB, total size 1024KB , throughput 9.70 GB/s
Test completed: latency 12.1us, block size 64KB, total size 2048KB , throughput 19.33 GB/s
Test completed: latency 10.2us, block size 128KB, total size 4096KB , throughput 39.98 GB/s
Test completed: latency 11.3us, block size 256KB, total size 8192KB , throughput 76.26 GB/s
Test completed: latency 13.6us, block size 512KB, total size 16384KB , throughput 151.83 GB/s
Test completed: latency 15.9us, block size 1024KB, total size 32768KB , throughput 179.72 GB/s
Test completed: latency 23.7us, block size 2048KB, total size 65536KB , throughput 179.87 GB/s
Test completed: latency 35.5us, block size 4096KB, total size 131072KB , throughput 184.24 GB/s
Test completed: latency 57.8us, block size 8192KB, total size 262144KB , throughput 183.11 GB/s
Test completed: latency 102.5us, block size 16384KB, total size 524288KB , throughput 186.61 GB/s

Device: 0->2 传输性能

Test completed: latency 15.5us, block size 32KB, total size 1024KB , throughput 9.79 GB/s
Test completed: latency 15.1us, block size 64KB, total size 2048KB , throughput 18.96 GB/s
Test completed: latency 15.6us, block size 128KB, total size 4096KB , throughput 37.72 GB/s
Test completed: latency 15.9us, block size 256KB, total size 8192KB , throughput 71.15 GB/s
Test completed: latency 16.3us, block size 512KB, total size 16384KB , throughput 93.10 GB/s
Test completed: latency 22.5us, block size 1024KB, total size 32768KB , throughput 110.23 GB/s
Test completed: latency 29.5us, block size 2048KB, total size 65536KB , throughput 122.04 GB/s
Test completed: latency 45.8us, block size 4096KB, total size 131072KB , throughput 118.61 GB/s
Test completed: latency 77.2us, block size 8192KB, total size 262144KB , throughput 130.72 GB/s
Test completed: latency 139us, block size 16384KB, total size 524288KB , throughput 133.87 GB/s