GGitHub[feat] Support metastore mode for Yuanrong backend init (#74 )

TransferQueue Throughput Test

This script runs throughput tests for TransferQueue with different backends.

Prerequisites

Start Ray cluster with node resources:

# On head node
ray start --head --resources='{"node:192.168.0.1":1}'
# On worker node
ray start --address=192.168.0.1:6379 --resources='{"node:192.168.0.2":1}'

Start the backend service (Yuanrong, MooncakeStore, etc.) if testing non-SimpleStorage backends.

Usage

python perftest.py \
  --backend_config=perftest_config.yaml \
  --backend=SimpleStorage \
  --device=cpu \
  --global_batch_size=1024 \
  --field_num=9 \
  --seq_len=8192 \
  --head_node_ip=192.168.0.1 \
  --worker_node_ip=192.168.0.2

Arguments

Argument	Description	Default	Required
`--backend_config`	Path to backend config YAML file	-	Yes
`--backend`	Override `storage_backend` in config (`SimpleStorage`, `Yuanrong`, `MooncakeStore`)	None	No
`--device`	Device: `cpu`, `npu`, `gpu`	`cpu`	No
`--global_batch_size`	Global batch size	1024	No
`--field_num`	Number of fields in the TensorDict	10	No
`--seq_len`	Sequence length	8192	No
`--num_test_iterations`	Number of test iterations	4	No
`--head_node_ip`	Head node IP address	-	Yes
`--worker_node_ip`	Worker node IP address (required for Yuanrong)	None	No
`--output_csv`	Path to output CSV file	None	No
`--use_complex_case`	Use complex test case with nested tensors and NonTensorStack fields	False	No

Backend Configuration

The script reads the backend configuration directly from the provided --backend_config YAML file. The backend type is determined by backend.storage_backend in the config file. When --backend is specified, it overrides the value in the config.

SimpleStorage Configuration

backend:
  storage_backend: SimpleStorage
  SimpleStorage:
    total_storage_size: 100000
    num_data_storage_units: 16

Yuanrong Configuration

backend:
  storage_backend: Yuanrong
  Yuanrong:
    auto_init: True
    worker_port: 31501
    metastore_port: 2379
    enable_yr_npu_transport: true
    worker_args: "--shared_memory_size_mb 65536 --remote_h2d_device_ids 0 --enable_huge_tlb true"

For Yuanrong backend, writer runs on the head node and reader runs on the worker node. --worker_node_ip is required.

MooncakeStore Configuration

backend:
  storage_backend: MooncakeStore
  MooncakeStore:
    auto_init: true
    metadata_server: localhost:50050
    master_server_address: localhost:50051
    local_hostname: ""
    protocol: rdma
    global_segment_size: 86294967296
    local_buffer_size: 86294967296
    device_name: ""

Test Scenarios

Simple Test Case (Default)

When --use_complex_case is not specified (default), the test creates a TensorDict with only regular tensors:

Regular tensors: Shape (batch_size, seq_length), float32.

Each regular tensor field size = batch_size × seq_length × 4 bytes.

Complex Test Case

When --use_complex_case is specified, the test creates a TensorDict with three types of fields to simulate real training batches:

Regular tensors: Shape (batch_size, seq_length), float32.
Nested tensors (non-NPU devices): Variable-length ragged sequences with lengths forming an arithmetic progression from 1 to seq_length. Average length ≈ seq_length / 2, so each nested field is roughly half the size of a regular field.
NonTensorStack strings: Each string is seq_length × 4 bytes, matching the memory footprint of one tensor element.

Fields are distributed evenly across the three types (rounded up). For NPU devices, nested tensors fall back to regular tensors of shape (batch_size, seq_length // 2).

Test Flow

Each iteration performs a PUT → LIST → GET → DELETE cycle via TransferQueue's KV API:

PUT (kv_batch_put): Writer sends the TensorDict to storage.
LIST (kv_list): Reader queries available keys in the partition.
GET (kv_batch_get): Reader fetches data for those keys.
DELETE (kv_clear): Writer removes the written data.

The test runs --num_test_iterations iterations. Data creation only happens in the first iteration; subsequent iterations reuse the same TensorDict to isolate transfer overhead.

Running Full Test Suite

The run_perf_test.sh script automates the full test suite across all backends and data sizes, then generates a comparison chart:

cd scripts/performance_test
./run_perf_test.sh

Configuration

Configure via environment variables:

Variable	Description	Default
`HEAD_NODE_IP`	Head node IP address	`127.0.0.1`
`WORKER_NODE_IP`	Worker node IP address	`127.0.0.1`
`DEVICE`	Device type (`cpu`, `npu`, `gpu`)	`cpu`
`NUM_TEST_ITERATIONS`	Number of iterations per test	`4`
`USE_COMPLEX_CASE`	Run with complex test case (nested + nontensor fields)	`false`

Example:

# Simple case (default, regular tensors only)
./run_perf_test.sh

# Complex case (nested tensors + nontensor strings)
USE_COMPLEX_CASE=true ./run_perf_test.sh

# With specific node IPs & use NPU
HEAD_NODE_IP=192.168.0.1 WORKER_NODE_IP=192.168.0.2 DEVICE=npu ./run_perf_test.sh

Test Matrix

Backends: SimpleStorage, Yuanrong, MooncakeStore, Ray (baseline)
Data sizes: Small (batch=1024, fields=9, seq=8192), Medium (batch=4096, fields=15, seq=32768), Large (batch=8192, fields=18, seq=100000)

Output

CSV results: results/{backend}_{size}.csv (e.g., results/simplestorage_small.csv, results/ray_baseline_medium.csv)
Performance chart: results/performance_comparison.pdf

Ray Baseline

ray_perftest_baseline.py measures raw Ray inter-node transfer throughput without TransferQueue, serving as a baseline. It passes a TensorDict directly to a remote Ray actor (via ray.get), using the same test data format. It is automatically included in run_perf_test.sh.

draw_figure.py

After running the tests, draw_figure.py reads all CSV files from results/ and generates a grouped bar chart comparing total throughput (Gbps) across backends and data sizes.

Examples

SimpleStorage backend (simple case)

python perftest.py --backend_config=perftest_config.yaml --backend=SimpleStorage \
  --head_node_ip=192.168.0.1

SimpleStorage backend (complex case)

python perftest.py --backend_config=perftest_config.yaml --backend=SimpleStorage \
  --head_node_ip=192.168.0.1 --use_complex_case

Yuanrong backend (inter-node)

python perftest.py --backend_config=perftest_config.yaml --backend=Yuanrong \
  --head_node_ip=192.168.0.1 --worker_node_ip=192.168.0.2

MooncakeStore backend

python perftest.py --backend_config=perftest_config.yaml --backend=MooncakeStore \
  --head_node_ip=192.168.0.1

NPU device test (Yuanrong)

python perftest.py --backend_config=perftest_config.yaml --backend=Yuanrong --device=npu \
  --head_node_ip=192.168.0.1 --worker_node_ip=192.168.0.2

Output to CSV

python perftest.py --backend_config=perftest_config.yaml --backend=SimpleStorage \
  --head_node_ip=192.168.0.1 --output_csv=results.csv

Output Format

The test prints:

Total data size
PUT time and throughput
GET time and throughput
Total round-trip throughput

Throughput is shown in both Gb/s (gigabits per second) and GB/s (gigabytes per second).

CSV Columns

Column	Description
`backend`	Backend name
`device`	Device type
`total_data_size_gb`	Data size in GB
`put_time`	PUT duration (seconds)
`get_time`	GET duration (seconds)
`put_gbit_per_sec`	PUT throughput (Gbps)
`get_gbit_per_sec`	GET throughput (Gbps)
`total_gbit_per_sec`	Round-trip throughput (Gbps)