TransferQueue:基于分布式存储的高性能数据流转管理项目

异步高性能流式数据引擎

文件	最后提交记录	最后更新时间
.github	[fix] Make the controller survive bad ZMQ messages and repair the pickle fallback (#143) ## Motivation A long-running RL job hit an occasional failure that took the whole run down: `Exception in thread TransferQueueControllerProcessRequestThread: Traceback (most recent call last): ... File "transfer_queue/controller.py", line 1771, in _process_request request_msg = ZMQMessage.deserialize(serialized_msg) File "transfer_queue/utils/zmq_utils.py", line 182, in deserialize result = decode(frames) File "transfer_queue/utils/serial_utils.py", line 262, in decode result = self.decoder.decode(bufs[0]) msgspec.DecodeError: Input data was truncated` `_process_request` had no exception handling, so this one message killed `TransferQueueControllerProcessRequestThread` permanently. The controller stayed alive as an actor but stopped answering every request from that point on, and the job hung. While investigating I found more defects on the same path. ## What was wrong 1. One bad message killed the request loop. `_process_request` ran `recv_multipart` → `deserialize` → dispatch → `send_multipart` with no guard anywhere. Note that `_wait_connection` in the same file already logs and continues; the request loop did not. 2. Dropped requests hang the caller forever. Clients issue requests through `with_controller_socket` with no `RCVTIMEO` and then block on `await socket.recv_multipart()`. Any controller path that fails to reply — an undecodable request, an unhandled request type, a handler that raises — leaves that caller waiting for the rest of the run. 3. `decode()` could never detect the pickle fallback marker. It located the marker with `frames[0] == _PICKLE_FALLBACK_SENTINEL`. Every receiver calls `recv_multipart(copy=False)` and therefore holds `zmq.Frame` objects, and `zmq.Frame` defines no `__eq__` against `bytes`, so the comparison was always False and the marker frame was handed to msgpack. 4. `encode()` almost never reached that fallback anyway. It caught only `TypeError`/`ValueError`, but msgspec reports unrepresentable values with `OverflowError` (an `ArithmeticError`), `RecursionError` (a `RuntimeError`) and its own `MsgspecError` — none of which derive from `ValueError`. Those escaped to the caller instead of degrading to pickle, even though pickle handles all of them (arbitrary-precision ints, self-referential containers). 5. Unhandled request types replied with a stale response. Every branch in the dispatch chain is an `if`/`elif` with no `else`, so an unknown `request_type` fell through to `send_multipart([identity, response_msg.serialize()])` still holding the previous* iteration's `response_msg`, sending an unrelated requester someone else's answer. ## What this changes - `_process_request` is split into a thin supervisor plus `_process_request_loop`; anything that escapes the per-request handling restarts the loop, so the thread survives for the controller's lifetime. The ROUTER socket stays bound across a restart. - The dispatch chain moves into `_handle_request()`, which returns the response or `None`. Every request now gets exactly one reply: - the handler's normal response; - a new generic `ZMQRequestType.REQUEST_ERROR` (built by `_make_error_response()`) when the handler raises — the reply body carries `{request_type} failed: {error}`, so the caller raises immediately instead of hanging; - `REQUEST_ERROR` ("no handler for request_type ...") when no branch matches, replacing the stale-response replay; - `REQUEST_ERROR` for undecodable requests too — the ROUTER identity frame is prepended by the transport and survives payload corruption, so the reply still reaches the requester. No client change is needed: every client call site already treats an unexpected response type as `RuntimeError(body["message"])`. - `ZMQMessage.deserialize` self-checks the frame layout and raises `ZMQMessageDecodeError` carrying frame count and per-frame sizes. It lives in `deserialize`, so controller, storage, client and manager all get the diagnostics. - `decode()` compares buffer contents via `_is_pickle_fallback()`. Size is tested first, so the msgpack path stays copy-free. - `encode()` catches the errors msgspec actually raises, via a named `_ENCODE_FALLBACK_ERRORS` tuple documenting why none of them are `ValueError`. Falling back now logs at `warning` level — it is a performance degradation path and should be visible at default log levels. - mypy now runs over the whole package in pre-commit (`pass_filenames: false`); with `follow_imports = "skip"`, passing only changed files produced false errors. ## Scope This does not fix the root cause of the truncation. The evidence says the ZMQ multipart frame boundaries shift, so frame 0 stops being the msgpack header — decoding an empty buffer produces exactly `Input data was truncated`, and empty frames are routine here since empty tensors serialize to zero-length frames. What this PR does is stop that from taking the job down, and emit the frame layout needed to confirm it. The frame sizes make the diagnosis immediate: `healthy message: num_frames=3, frame_sizes=[184, 512, 512] empty leading frame: leading frame is empty; num_frames=4, frame_sizes=[0, 184, 512, 512] boundary on a tensor frame: trailing characters (byte 1); num_frames=2, frame_sizes=[512, 512] truncated header: Input data was truncated; num_frames=3, frame_sizes=[12, 512, 512]` A healthy message is one small header frame followed by large buffers. A leading `0`, a missing small header, or an implausibly small header all point at shifted boundaries. One caveat to note for rollout: the pickle fallback only works end to end when both peers run this version — an old receiver (`recv_multipart(copy=False)` + `==` marker check) cannot recognise fallback frames from a new sender, and new senders produce them in more situations than before. Upgrade controller, storage units and clients together. ## Tests `TestPickleFallback` in `tests/test_serial_utils_on_cpu.py` (13 cases): oversized ints and self-referential containers degrade instead of raising, round-trips through `zmq.Frame`, marker detection across `bytes`/`bytearray`/`memoryview`/`zmq.Frame`, four near-miss negatives, and an end-to-end round trip over a real ROUTER with `recv_multipart(copy=False)` — the transport that hid the bug. `TestTransferQueueControllerBadRequests` in `tests/test_controller.py` (3 cases): - an empty leading frame (shifted multipart boundary) gets a `REQUEST_ERROR` reply and the controller keeps serving; - an unhandled `PUT_DATA` gets a `REQUEST_ERROR` reply instead of replaying the previous iteration's stale response; - a `GET_META` body missing its keys makes the handler raise `KeyError`, and the requester gets a `REQUEST_ERROR` ("GET_META failed: KeyError ...") while the loop keeps answering subsequent requests. All three were confirmed to fail against the pre-fix code and pass after. Results: 115 passed across the two serialization suites, 25 passed across `tests/test_controller.py` (including the 3 new cases), and 76 passed across `tests/e2e/`. `pre-commit run --all-files` is green (ruff, ruff-format, mypy). ## Unrelated CI fix bundled here The MooncakeStore e2e job started failing on `libcudart.so.12: cannot open shared object file` — upstream `mooncake-transfer-engine-non-cuda` 0.3.12 ships a `mooncake_master` binary linked against the CUDA 12 runtime (verified by diffing the DT_NEEDED entries of the 0.3.11.post1 and 0.3.12.post1 wheels; 0.3.11.post1 has no CUDA dependency). The two workflows that install it now pin `mooncake-transfer-engine-non-cuda<0.3.12` until upstream fixes the wheel. --------- Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>	1 天前
docs	[feat] support GDR in mooncake backend (#131) Add GDR support for mooncake backend Signed-off-by: xupinjie <xupinjie321@outlook.com>	20 天前
recipe	[feat] Support cross-job actor discovery via explicit namespace (#115) When multiple Ray Jobs share the same Ray cluster, Named Actors are isolated by namespace. Without an explicit namespace, a TQ Controller created by one job is invisible to workers in another job. This commit adds namespace="transfer_queue" to both: - ray.get_actor() in _init_from_existing() - TransferQueueController.options() in init() This ensures that the TQ Controller is always registered and discovered in the fixed "transfer_queue" namespace, enabling cross-job TQ sharing (e.g., a teacher server job creates TQ, and a trainer job connects to it). This change is backward-compatible: single-job usage is unaffected since the namespace is consistent between creation and discovery. Signed-off-by: huniu20 <huniumail@gmail.com>	1 个月前
scripts	[feat] support GDR in mooncake backend (#131) Add GDR support for mooncake backend Signed-off-by: xupinjie <xupinjie321@outlook.com>	20 天前
tests	[fix] Make the controller survive bad ZMQ messages and repair the pickle fallback (#143) ## Motivation A long-running RL job hit an occasional failure that took the whole run down: `Exception in thread TransferQueueControllerProcessRequestThread: Traceback (most recent call last): ... File "transfer_queue/controller.py", line 1771, in _process_request request_msg = ZMQMessage.deserialize(serialized_msg) File "transfer_queue/utils/zmq_utils.py", line 182, in deserialize result = decode(frames) File "transfer_queue/utils/serial_utils.py", line 262, in decode result = self.decoder.decode(bufs[0]) msgspec.DecodeError: Input data was truncated` `_process_request` had no exception handling, so this one message killed `TransferQueueControllerProcessRequestThread` permanently. The controller stayed alive as an actor but stopped answering every request from that point on, and the job hung. While investigating I found more defects on the same path. ## What was wrong 1. One bad message killed the request loop. `_process_request` ran `recv_multipart` → `deserialize` → dispatch → `send_multipart` with no guard anywhere. Note that `_wait_connection` in the same file already logs and continues; the request loop did not. 2. Dropped requests hang the caller forever. Clients issue requests through `with_controller_socket` with no `RCVTIMEO` and then block on `await socket.recv_multipart()`. Any controller path that fails to reply — an undecodable request, an unhandled request type, a handler that raises — leaves that caller waiting for the rest of the run. 3. `decode()` could never detect the pickle fallback marker. It located the marker with `frames[0] == _PICKLE_FALLBACK_SENTINEL`. Every receiver calls `recv_multipart(copy=False)` and therefore holds `zmq.Frame` objects, and `zmq.Frame` defines no `__eq__` against `bytes`, so the comparison was always False and the marker frame was handed to msgpack. 4. `encode()` almost never reached that fallback anyway. It caught only `TypeError`/`ValueError`, but msgspec reports unrepresentable values with `OverflowError` (an `ArithmeticError`), `RecursionError` (a `RuntimeError`) and its own `MsgspecError` — none of which derive from `ValueError`. Those escaped to the caller instead of degrading to pickle, even though pickle handles all of them (arbitrary-precision ints, self-referential containers). 5. Unhandled request types replied with a stale response. Every branch in the dispatch chain is an `if`/`elif` with no `else`, so an unknown `request_type` fell through to `send_multipart([identity, response_msg.serialize()])` still holding the previous* iteration's `response_msg`, sending an unrelated requester someone else's answer. ## What this changes - `_process_request` is split into a thin supervisor plus `_process_request_loop`; anything that escapes the per-request handling restarts the loop, so the thread survives for the controller's lifetime. The ROUTER socket stays bound across a restart. - The dispatch chain moves into `_handle_request()`, which returns the response or `None`. Every request now gets exactly one reply: - the handler's normal response; - a new generic `ZMQRequestType.REQUEST_ERROR` (built by `_make_error_response()`) when the handler raises — the reply body carries `{request_type} failed: {error}`, so the caller raises immediately instead of hanging; - `REQUEST_ERROR` ("no handler for request_type ...") when no branch matches, replacing the stale-response replay; - `REQUEST_ERROR` for undecodable requests too — the ROUTER identity frame is prepended by the transport and survives payload corruption, so the reply still reaches the requester. No client change is needed: every client call site already treats an unexpected response type as `RuntimeError(body["message"])`. - `ZMQMessage.deserialize` self-checks the frame layout and raises `ZMQMessageDecodeError` carrying frame count and per-frame sizes. It lives in `deserialize`, so controller, storage, client and manager all get the diagnostics. - `decode()` compares buffer contents via `_is_pickle_fallback()`. Size is tested first, so the msgpack path stays copy-free. - `encode()` catches the errors msgspec actually raises, via a named `_ENCODE_FALLBACK_ERRORS` tuple documenting why none of them are `ValueError`. Falling back now logs at `warning` level — it is a performance degradation path and should be visible at default log levels. - mypy now runs over the whole package in pre-commit (`pass_filenames: false`); with `follow_imports = "skip"`, passing only changed files produced false errors. ## Scope This does not fix the root cause of the truncation. The evidence says the ZMQ multipart frame boundaries shift, so frame 0 stops being the msgpack header — decoding an empty buffer produces exactly `Input data was truncated`, and empty frames are routine here since empty tensors serialize to zero-length frames. What this PR does is stop that from taking the job down, and emit the frame layout needed to confirm it. The frame sizes make the diagnosis immediate: `healthy message: num_frames=3, frame_sizes=[184, 512, 512] empty leading frame: leading frame is empty; num_frames=4, frame_sizes=[0, 184, 512, 512] boundary on a tensor frame: trailing characters (byte 1); num_frames=2, frame_sizes=[512, 512] truncated header: Input data was truncated; num_frames=3, frame_sizes=[12, 512, 512]` A healthy message is one small header frame followed by large buffers. A leading `0`, a missing small header, or an implausibly small header all point at shifted boundaries. One caveat to note for rollout: the pickle fallback only works end to end when both peers run this version — an old receiver (`recv_multipart(copy=False)` + `==` marker check) cannot recognise fallback frames from a new sender, and new senders produce them in more situations than before. Upgrade controller, storage units and clients together. ## Tests `TestPickleFallback` in `tests/test_serial_utils_on_cpu.py` (13 cases): oversized ints and self-referential containers degrade instead of raising, round-trips through `zmq.Frame`, marker detection across `bytes`/`bytearray`/`memoryview`/`zmq.Frame`, four near-miss negatives, and an end-to-end round trip over a real ROUTER with `recv_multipart(copy=False)` — the transport that hid the bug. `TestTransferQueueControllerBadRequests` in `tests/test_controller.py` (3 cases): - an empty leading frame (shifted multipart boundary) gets a `REQUEST_ERROR` reply and the controller keeps serving; - an unhandled `PUT_DATA` gets a `REQUEST_ERROR` reply instead of replaying the previous iteration's stale response; - a `GET_META` body missing its keys makes the handler raise `KeyError`, and the requester gets a `REQUEST_ERROR` ("GET_META failed: KeyError ...") while the loop keeps answering subsequent requests. All three were confirmed to fail against the pre-fix code and pass after. Results: 115 passed across the two serialization suites, 25 passed across `tests/test_controller.py` (including the 3 new cases), and 76 passed across `tests/e2e/`. `pre-commit run --all-files` is green (ruff, ruff-format, mypy). ## Unrelated CI fix bundled here The MooncakeStore e2e job started failing on `libcudart.so.12: cannot open shared object file` — upstream `mooncake-transfer-engine-non-cuda` 0.3.12 ships a `mooncake_master` binary linked against the CUDA 12 runtime (verified by diffing the DT_NEEDED entries of the 0.3.11.post1 and 0.3.12.post1 wheels; 0.3.11.post1 has no CUDA dependency). The two workflows that install it now pin `mooncake-transfer-engine-non-cuda<0.3.12` until upstream fixes the wheel. --------- Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>	1 天前
transfer_queue	[fix] Make the controller survive bad ZMQ messages and repair the pickle fallback (#143) ## Motivation A long-running RL job hit an occasional failure that took the whole run down: `Exception in thread TransferQueueControllerProcessRequestThread: Traceback (most recent call last): ... File "transfer_queue/controller.py", line 1771, in _process_request request_msg = ZMQMessage.deserialize(serialized_msg) File "transfer_queue/utils/zmq_utils.py", line 182, in deserialize result = decode(frames) File "transfer_queue/utils/serial_utils.py", line 262, in decode result = self.decoder.decode(bufs[0]) msgspec.DecodeError: Input data was truncated` `_process_request` had no exception handling, so this one message killed `TransferQueueControllerProcessRequestThread` permanently. The controller stayed alive as an actor but stopped answering every request from that point on, and the job hung. While investigating I found more defects on the same path. ## What was wrong 1. One bad message killed the request loop. `_process_request` ran `recv_multipart` → `deserialize` → dispatch → `send_multipart` with no guard anywhere. Note that `_wait_connection` in the same file already logs and continues; the request loop did not. 2. Dropped requests hang the caller forever. Clients issue requests through `with_controller_socket` with no `RCVTIMEO` and then block on `await socket.recv_multipart()`. Any controller path that fails to reply — an undecodable request, an unhandled request type, a handler that raises — leaves that caller waiting for the rest of the run. 3. `decode()` could never detect the pickle fallback marker. It located the marker with `frames[0] == _PICKLE_FALLBACK_SENTINEL`. Every receiver calls `recv_multipart(copy=False)` and therefore holds `zmq.Frame` objects, and `zmq.Frame` defines no `__eq__` against `bytes`, so the comparison was always False and the marker frame was handed to msgpack. 4. `encode()` almost never reached that fallback anyway. It caught only `TypeError`/`ValueError`, but msgspec reports unrepresentable values with `OverflowError` (an `ArithmeticError`), `RecursionError` (a `RuntimeError`) and its own `MsgspecError` — none of which derive from `ValueError`. Those escaped to the caller instead of degrading to pickle, even though pickle handles all of them (arbitrary-precision ints, self-referential containers). 5. Unhandled request types replied with a stale response. Every branch in the dispatch chain is an `if`/`elif` with no `else`, so an unknown `request_type` fell through to `send_multipart([identity, response_msg.serialize()])` still holding the previous* iteration's `response_msg`, sending an unrelated requester someone else's answer. ## What this changes - `_process_request` is split into a thin supervisor plus `_process_request_loop`; anything that escapes the per-request handling restarts the loop, so the thread survives for the controller's lifetime. The ROUTER socket stays bound across a restart. - The dispatch chain moves into `_handle_request()`, which returns the response or `None`. Every request now gets exactly one reply: - the handler's normal response; - a new generic `ZMQRequestType.REQUEST_ERROR` (built by `_make_error_response()`) when the handler raises — the reply body carries `{request_type} failed: {error}`, so the caller raises immediately instead of hanging; - `REQUEST_ERROR` ("no handler for request_type ...") when no branch matches, replacing the stale-response replay; - `REQUEST_ERROR` for undecodable requests too — the ROUTER identity frame is prepended by the transport and survives payload corruption, so the reply still reaches the requester. No client change is needed: every client call site already treats an unexpected response type as `RuntimeError(body["message"])`. - `ZMQMessage.deserialize` self-checks the frame layout and raises `ZMQMessageDecodeError` carrying frame count and per-frame sizes. It lives in `deserialize`, so controller, storage, client and manager all get the diagnostics. - `decode()` compares buffer contents via `_is_pickle_fallback()`. Size is tested first, so the msgpack path stays copy-free. - `encode()` catches the errors msgspec actually raises, via a named `_ENCODE_FALLBACK_ERRORS` tuple documenting why none of them are `ValueError`. Falling back now logs at `warning` level — it is a performance degradation path and should be visible at default log levels. - mypy now runs over the whole package in pre-commit (`pass_filenames: false`); with `follow_imports = "skip"`, passing only changed files produced false errors. ## Scope This does not fix the root cause of the truncation. The evidence says the ZMQ multipart frame boundaries shift, so frame 0 stops being the msgpack header — decoding an empty buffer produces exactly `Input data was truncated`, and empty frames are routine here since empty tensors serialize to zero-length frames. What this PR does is stop that from taking the job down, and emit the frame layout needed to confirm it. The frame sizes make the diagnosis immediate: `healthy message: num_frames=3, frame_sizes=[184, 512, 512] empty leading frame: leading frame is empty; num_frames=4, frame_sizes=[0, 184, 512, 512] boundary on a tensor frame: trailing characters (byte 1); num_frames=2, frame_sizes=[512, 512] truncated header: Input data was truncated; num_frames=3, frame_sizes=[12, 512, 512]` A healthy message is one small header frame followed by large buffers. A leading `0`, a missing small header, or an implausibly small header all point at shifted boundaries. One caveat to note for rollout: the pickle fallback only works end to end when both peers run this version — an old receiver (`recv_multipart(copy=False)` + `==` marker check) cannot recognise fallback frames from a new sender, and new senders produce them in more situations than before. Upgrade controller, storage units and clients together. ## Tests `TestPickleFallback` in `tests/test_serial_utils_on_cpu.py` (13 cases): oversized ints and self-referential containers degrade instead of raising, round-trips through `zmq.Frame`, marker detection across `bytes`/`bytearray`/`memoryview`/`zmq.Frame`, four near-miss negatives, and an end-to-end round trip over a real ROUTER with `recv_multipart(copy=False)` — the transport that hid the bug. `TestTransferQueueControllerBadRequests` in `tests/test_controller.py` (3 cases): - an empty leading frame (shifted multipart boundary) gets a `REQUEST_ERROR` reply and the controller keeps serving; - an unhandled `PUT_DATA` gets a `REQUEST_ERROR` reply instead of replaying the previous iteration's stale response; - a `GET_META` body missing its keys makes the handler raise `KeyError`, and the requester gets a `REQUEST_ERROR` ("GET_META failed: KeyError ...") while the loop keeps answering subsequent requests. All three were confirmed to fail against the pre-fix code and pass after. Results: 115 passed across the two serialization suites, 25 passed across `tests/test_controller.py` (including the 3 new cases), and 76 passed across `tests/e2e/`. `pre-commit run --all-files` is green (ruff, ruff-format, mypy). ## Unrelated CI fix bundled here The MooncakeStore e2e job started failing on `libcudart.so.12: cannot open shared object file` — upstream `mooncake-transfer-engine-non-cuda` 0.3.12 ships a `mooncake_master` binary linked against the CUDA 12 runtime (verified by diffing the DT_NEEDED entries of the 0.3.11.post1 and 0.3.12.post1 wheels; 0.3.11.post1 has no CUDA dependency). The two workflows that install it now pin `mooncake-transfer-engine-non-cuda<0.3.12` until upstream fixes the wheel. --------- Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>	1 天前
tutorial	[feat] Support cross-job actor discovery via explicit namespace (#115) When multiple Ray Jobs share the same Ray cluster, Named Actors are isolated by namespace. Without an explicit namespace, a TQ Controller created by one job is invisible to workers in another job. This commit adds namespace="transfer_queue" to both: - ray.get_actor() in _init_from_existing() - TransferQueueController.options() in init() This ensures that the TQ Controller is always registered and discovered in the fixed "transfer_queue" namespace, enabling cross-job TQ sharing (e.g., a teacher server job creates TQ, and a trainer job connects to it). This change is backward-compatible: single-job usage is unaffected since the namespace is consistent between creation and discovery. Signed-off-by: huniu20 <huniumail@gmail.com>	1 个月前
.gitignore	[Perf] Refactor performance test for different kv store backends (#52) # Description 1. support different kv backends 2. support intra-node and inter-node client placement for yr 3. output to csv 4. remove the non-tensor part when create_complex_test_case 5. remove ray bandwidth test 6. add readme for perf test 7. test 3 times to mitigate variance (warmup) 8. use kv client to simplify usage # Usage bash usage: perftest.py [-h] --backend_config BACKEND_CONFIG [--backend BACKEND] [--device {cpu,npu,gpu}] [--global_batch_size GLOBAL_BATCH_SIZE] [--field_num FIELD_NUM] [--seq_len SEQ_LEN] [--num_test_iterations NUM_TEST_ITERATIONS] --head_node_ip HEAD_NODE_IP [--worker_node_ip WORKER_NODE_IP] [--output_csv OUTPUT_CSV] [--use_complex_case] TransferQueue Throughput Test options: -h, --help show this help message and exit --backend_config BACKEND_CONFIG Path to backend config YAML file --backend BACKEND Override storage_backend in config (e.g. SimpleStorage, Yuanrong, MooncakeStore) --device {cpu,npu,gpu} Device to use (default: cpu) --global_batch_size GLOBAL_BATCH_SIZE Global batch size (default: 1024) --field_num FIELD_NUM Number of fields (default: 10) --seq_len SEQ_LEN Sequence length (default: 8192) --num_test_iterations NUM_TEST_ITERATIONS Number of test iterations (default: 4) --head_node_ip HEAD_NODE_IP Head node IP address --worker_node_ip WORKER_NODE_IP Worker node IP address (required for Yuanrong) --output_csv OUTPUT_CSV Path to output CSV file (optional) --use_complex_case Use complex test case with nested tensors and nontensor fields (default: False, simple case) closes #51 --------- Signed-off-by: tianyi-ge <tianyig@outlook.com> Signed-off-by: 0oshowero0 <o0shower0o@outlook.com> Co-authored-by: 0oshowero0 <o0shower0o@outlook.com>	4 个月前
.pre-commit-config.yaml	[fix] Make the controller survive bad ZMQ messages and repair the pickle fallback (#143) ## Motivation A long-running RL job hit an occasional failure that took the whole run down: `Exception in thread TransferQueueControllerProcessRequestThread: Traceback (most recent call last): ... File "transfer_queue/controller.py", line 1771, in _process_request request_msg = ZMQMessage.deserialize(serialized_msg) File "transfer_queue/utils/zmq_utils.py", line 182, in deserialize result = decode(frames) File "transfer_queue/utils/serial_utils.py", line 262, in decode result = self.decoder.decode(bufs[0]) msgspec.DecodeError: Input data was truncated` `_process_request` had no exception handling, so this one message killed `TransferQueueControllerProcessRequestThread` permanently. The controller stayed alive as an actor but stopped answering every request from that point on, and the job hung. While investigating I found more defects on the same path. ## What was wrong 1. One bad message killed the request loop. `_process_request` ran `recv_multipart` → `deserialize` → dispatch → `send_multipart` with no guard anywhere. Note that `_wait_connection` in the same file already logs and continues; the request loop did not. 2. Dropped requests hang the caller forever. Clients issue requests through `with_controller_socket` with no `RCVTIMEO` and then block on `await socket.recv_multipart()`. Any controller path that fails to reply — an undecodable request, an unhandled request type, a handler that raises — leaves that caller waiting for the rest of the run. 3. `decode()` could never detect the pickle fallback marker. It located the marker with `frames[0] == _PICKLE_FALLBACK_SENTINEL`. Every receiver calls `recv_multipart(copy=False)` and therefore holds `zmq.Frame` objects, and `zmq.Frame` defines no `__eq__` against `bytes`, so the comparison was always False and the marker frame was handed to msgpack. 4. `encode()` almost never reached that fallback anyway. It caught only `TypeError`/`ValueError`, but msgspec reports unrepresentable values with `OverflowError` (an `ArithmeticError`), `RecursionError` (a `RuntimeError`) and its own `MsgspecError` — none of which derive from `ValueError`. Those escaped to the caller instead of degrading to pickle, even though pickle handles all of them (arbitrary-precision ints, self-referential containers). 5. Unhandled request types replied with a stale response. Every branch in the dispatch chain is an `if`/`elif` with no `else`, so an unknown `request_type` fell through to `send_multipart([identity, response_msg.serialize()])` still holding the previous* iteration's `response_msg`, sending an unrelated requester someone else's answer. ## What this changes - `_process_request` is split into a thin supervisor plus `_process_request_loop`; anything that escapes the per-request handling restarts the loop, so the thread survives for the controller's lifetime. The ROUTER socket stays bound across a restart. - The dispatch chain moves into `_handle_request()`, which returns the response or `None`. Every request now gets exactly one reply: - the handler's normal response; - a new generic `ZMQRequestType.REQUEST_ERROR` (built by `_make_error_response()`) when the handler raises — the reply body carries `{request_type} failed: {error}`, so the caller raises immediately instead of hanging; - `REQUEST_ERROR` ("no handler for request_type ...") when no branch matches, replacing the stale-response replay; - `REQUEST_ERROR` for undecodable requests too — the ROUTER identity frame is prepended by the transport and survives payload corruption, so the reply still reaches the requester. No client change is needed: every client call site already treats an unexpected response type as `RuntimeError(body["message"])`. - `ZMQMessage.deserialize` self-checks the frame layout and raises `ZMQMessageDecodeError` carrying frame count and per-frame sizes. It lives in `deserialize`, so controller, storage, client and manager all get the diagnostics. - `decode()` compares buffer contents via `_is_pickle_fallback()`. Size is tested first, so the msgpack path stays copy-free. - `encode()` catches the errors msgspec actually raises, via a named `_ENCODE_FALLBACK_ERRORS` tuple documenting why none of them are `ValueError`. Falling back now logs at `warning` level — it is a performance degradation path and should be visible at default log levels. - mypy now runs over the whole package in pre-commit (`pass_filenames: false`); with `follow_imports = "skip"`, passing only changed files produced false errors. ## Scope This does not fix the root cause of the truncation. The evidence says the ZMQ multipart frame boundaries shift, so frame 0 stops being the msgpack header — decoding an empty buffer produces exactly `Input data was truncated`, and empty frames are routine here since empty tensors serialize to zero-length frames. What this PR does is stop that from taking the job down, and emit the frame layout needed to confirm it. The frame sizes make the diagnosis immediate: `healthy message: num_frames=3, frame_sizes=[184, 512, 512] empty leading frame: leading frame is empty; num_frames=4, frame_sizes=[0, 184, 512, 512] boundary on a tensor frame: trailing characters (byte 1); num_frames=2, frame_sizes=[512, 512] truncated header: Input data was truncated; num_frames=3, frame_sizes=[12, 512, 512]` A healthy message is one small header frame followed by large buffers. A leading `0`, a missing small header, or an implausibly small header all point at shifted boundaries. One caveat to note for rollout: the pickle fallback only works end to end when both peers run this version — an old receiver (`recv_multipart(copy=False)` + `==` marker check) cannot recognise fallback frames from a new sender, and new senders produce them in more situations than before. Upgrade controller, storage units and clients together. ## Tests `TestPickleFallback` in `tests/test_serial_utils_on_cpu.py` (13 cases): oversized ints and self-referential containers degrade instead of raising, round-trips through `zmq.Frame`, marker detection across `bytes`/`bytearray`/`memoryview`/`zmq.Frame`, four near-miss negatives, and an end-to-end round trip over a real ROUTER with `recv_multipart(copy=False)` — the transport that hid the bug. `TestTransferQueueControllerBadRequests` in `tests/test_controller.py` (3 cases): - an empty leading frame (shifted multipart boundary) gets a `REQUEST_ERROR` reply and the controller keeps serving; - an unhandled `PUT_DATA` gets a `REQUEST_ERROR` reply instead of replaying the previous iteration's stale response; - a `GET_META` body missing its keys makes the handler raise `KeyError`, and the requester gets a `REQUEST_ERROR` ("GET_META failed: KeyError ...") while the loop keeps answering subsequent requests. All three were confirmed to fail against the pre-fix code and pass after. Results: 115 passed across the two serialization suites, 25 passed across `tests/test_controller.py` (including the 3 new cases), and 76 passed across `tests/e2e/`. `pre-commit run --all-files` is green (ruff, ruff-format, mypy). ## Unrelated CI fix bundled here The MooncakeStore e2e job started failing on `libcudart.so.12: cannot open shared object file` — upstream `mooncake-transfer-engine-non-cuda` 0.3.12 ships a `mooncake_master` binary linked against the CUDA 12 runtime (verified by diffing the DT_NEEDED entries of the 0.3.11.post1 and 0.3.12.post1 wheels; 0.3.11.post1 has no CUDA dependency). The two workflows that install it now pin `mooncake-transfer-engine-non-cuda<0.3.12` until upstream fixes the wheel. --------- Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>	1 天前
LICENSE	initialize TransferQueue repository Co-authored-by: 0oshowero0<o0shower0o@outlook.com> # message auto-generated for no-merge-commit merge: !1 merge main into main [chore] initialize TransferQueue repository Created-by: hanzhenyu8 Commit-by: 0oshowero0 Merged-by: ascend-robot Description: ## Background This PR marks the initial commit of the TransferQueue repository. We would like to express our sincere gratitude to the community for their invaluable contributions during the incubation phase of TransferQueue. - [@NINGBENZHE](https://github.com/NINGBENZHE): [[Feat]: add check_data_production_status and check_consumption_status and support Polling get metadata](https://github.com/TransferQueue/TransferQueue/pull/157). - [@zhaohaidao](https://github.com/zhaohaidao): [[Feat] Support Mooncake Store backend](https://github.com/TransferQueue/TransferQueue/pull/162). ### Historical Context To preserve the project's heritage, the early development history remains accessible at: https://github.com/TransferQueue/TransferQueue. Moving forward, we will maintain a mirror repository under the [Ascend organization](https://github.com/Ascend) on GitHub. You are welcome to submit contributions or propose new ideas on either platform. <span style="color:#e60000;">We look forward to continuing this journey with all of you!</span> See merge request: Ascend/TransferQueue!1	6 个月前
README.md	[chore] Update README (#128) As title Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>	1 个月前
pyproject.toml	[chore] Bump version from 0.1.9.dev0 to 0.1.9 & update dependency (#139) As title Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>	20 天前
requirements.txt	[chore] Relax numpy version constraints (#113) As title Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>	1 个月前

自动翻译

TransferQueue：面向高效训练后处理的异步流式数据管理模块

论文 | 知乎 | 微信

🎉 概述

TransferQueue 是一个高性能数据存储与传输模块，具备全景数据可见性和流式调度能力，专为训练后处理工作流中的高效数据流而优化。

TransferQueue 提供细粒度、子样本级的数据管理和负载均衡能力。它作为数据网关，解耦计算任务间的显式数据依赖，支持分而治之的策略，显著简化算法控制器设计。

🔄 更新动态

2026年6月18日：🔥 TransferQueue 已在 ROLL 中被采用。此次集成引入了 RemoteBatch 抽象，实现了与现有 DataProto 设计的无缝兼容。
2026年6月9日：🔥 TransferQueue 已在 UniRL 中被采用，这是腾讯混元团队开发的面向多模态模型的统一强化学习框架。
2026年4月15日：🔥 TransferQueue 已在 Relax 中被采用！通过利用 StreamingDataLoader 抽象，它以微批次粒度在集群中调度训练数据，减少了单控制器设置中的同步障碍。
2026年4月10日：🔥 TransferQueue 现已正式集成到 verl 中！在 128 × H100 GPU 集群上，我们实现了多模态训练后处理端到端性能提升 49.1%！ 更多详情请参考我们的博客。
2026年2月8日：🔥 通过高层级 API PR#26、PR#28，初始化和使用得到极大简化。现在，您可以使用类 Redis 风格的 API 来利用 TransferQueue 提供的大多数高级特性！
2026年1月28日：我们实验性地引入了 StreamingDataLoader 接口，用于全流式的生产-消费管道。详情请参考我们的 tutorials/06_streaming_dataloader.py。
2025年12月30日：TransferQueue 与 verl 的集成已在 DAPO 算法上进行了大规模测试 (64 节点，1024 卡)。它显著优化了主机内存利用率并加速了数据传输。敬请期待更多细节！
2025年12月20日：🔥 官方教程发布！欢迎查阅。
2025年11月10日：我们将数据检索逻辑从 TransferQueueController 中解耦 PR#101。现在您可以实现自己的 Sampler 来自定义数据消费。
2025年11月5日：我们提供了 KVStorageManager，简化了与基于 KV 的存储后端的集成 PR#96。首个可用的基于 KV 的后端是 openYuanrong。
2025年11月4日：数据分区功能在 PR#98 中可用。现在您可以定义逻辑数据分区来管理训练/验证/测试数据集。
2025年10月25日：存储后端现在支持插件化 PR#66。您现在可以尝试将自己的存储后端与 TransferQueue 集成！
2025年10月21日：与 verl 的早期集成已准备就绪 verl/pull/3649。后续 PR 将通过完全解耦数据与控制流来优化单控制器架构。
2025年7月22日：我们在知乎上发表了一系列中文博客文章：知乎 1、2。
2025年7月21日：我们在 verl 社区发起了 RFC verl/RFC#2662。
2025年7月2日：我们发表了论文 AsyncFlow。

🧩 组件

控制平面：全景数据管理

在控制平面中，TransferQueueController 将每个训练样本的生产状态和消费状态作为元数据进行跟踪。当所有所需数据字段准备就绪（即已写入 StorageManager）后，数据样本即可被下游任务消费。

我们还会跟踪每个计算任务（如 generate_sequences、compute_log_prob 等）的消费历史。因此，即使不同的计算任务需要相同的数据字段，它们也能独立消费数据，互不干扰。

为了使数据检索过程更具可定制性，我们提供了 Sampler 类，允许用户定义自己的数据检索和消费逻辑。详情请参考自定义部分。

控制平面已实验性支持负载均衡能力。此设计使我们能够从单一控制器卸载部分数据管理功能。详情请参见#PR70。

数据平面：分布式数据存储

在数据平面中，我们采用可插拔设计，使 TransferQueue 能够根据用户需求集成不同的存储后端。

具体而言，我们提供了 StorageManager 抽象类，其定义的核心 API 如下：

async def put_data(self, data: TensorDict, metadata: BatchMeta) -> None
async def get_data(self, metadata: BatchMeta) -> TensorDict
async def clear_data(self, metadata: BatchMeta) -> None

该类封装了 TransferQueue 系统内部的核心交互逻辑。您只需编写一个简单的子类，即可集成自定义的存储后端。详情请参考自定义部分。

目前，我们支持以下存储后端：

SimpleStorage：基础的 CPU 内存存储，数据格式约束最小，易于使用。
Yuanrong（使用指南，测试版，#PR107，#PR96）：Ascend 原生数据系统，提供包括 HBM/DRAM/SSD 在内的分层存储接口。
MooncakeStore（测试版，#PR162）：高性能、基于 KV 的分层存储，支持 GPU 与 DRAM 之间的 RDMA 传输。
RayRDT（alpha 版，#PR167）：Ray 的新特性，允许 Ray 在 Ray actors 之间直接存储和传递对象。

其中，SimpleStorageUnit 作为我们的默认存储后端，由 AsyncSimpleStorageManager 类进行协调。每个存储单元可部署在独立节点上，实现分布式数据管理。

SimpleStorageUnit 采用二维数据结构，具体如下：

每一行对应一个训练样本，在相应的全局批次中分配有唯一索引。
每一列代表计算任务的输入/输出数据字段。

这种数据结构设计源于训练后处理流程的计算特性，即每个训练样本通过任务流水线以接力方式生成。它提供了精确的寻址能力，支持以流式方式进行细粒度、并发的数据读写操作。

用户界面：高级与低级 API

层级	类型	风格	细粒度访问	流式处理	采样器	多后端支持
高级	KV 接口 (PR#28)	Put/Get/List/Clear	✓	○	✗	✓
高级	StreamingDataLoader (PR#23)	PyTorch DataLoader	✓	✓	✓	✓
低级	TransferQueueClient	基于元数据	✓	✓	✓	✓

基于键值（Key-Value）的 API

为简化 TransferQueue 的使用，我们提供了类 Redis 风格的高级 API，该 API 可暴露其大部分高级功能（PR#28）。

方法

(async_)kv_put：通过键插入/更新多列样本，可选择添加元数据标签。
(async_)kv_batch_put：高效批量写入多个键值对。
(async_)kv_batch_get：通过键检索样本，支持按字段选择列。
(async_)kv_list：列出分区中的键和标签（元数据）。
(async_)kv_clear：从存储中移除键值对。

核心特性

类 Redis 语义：熟悉的 KV 接口（Put/Get/List），实现零学习成本。
细粒度访问：可更新或检索键（行）内的特定字段（列），无需全行列操作。
分区隔离：存储命名空间的逻辑隔离。
元数据标签：用于状态跟踪的轻量级元数据。
可插拔后端：支持多种后端。

详细使用示例请参考 tutorials/basic.ipynb 和 tutorials/02_kv_interface.py。

StreamingDataLoader API

该 API 设计为 PyTorch 标准 DataLoader 的即插即用替代品，允许每个进程（rank）自动消费数据，无需单一控制器干预。

在此场景中，TransferQueueController 作为数据分发的辅助控制器，配合用户定义的 Sampler 类来组织数据流。它封装了各种并行策略所需的复杂调度和数据传输逻辑，将 TransferQueue 无缝集成到现有训练工作流中，简化了分布式框架的开发。

更多详情请参见路线图和 tutorials/06_streaming_dataloader.py。

低级原生 API

TransferQueue 的原生接口通过 TransferQueueClient 实现。它通过原生的原子操作提供最大灵活性。

开发人员可直接利用 TransferQueueClient 实现需要细粒度控制和全流式数据调度的高级功能，相关示例请参见以下教程：

🔥 应用案例

同构部署示例

verl

将 TransferQueue 集成到 verl 中的主要目的是缓解单一控制器 RayPPOTrainer 的数据传输瓶颈。目前，所有 DataProto 对象都必须通过 RayPPOTrainer 进行路由，这导致整个训练后系统存在单点瓶颈。

verl 的官方集成版本可在 verl/pull/5401 获取，设计文档请参见 [RFC] PPOTrainer with TransferQueue Integration。您也可以参考我们的示例代码，其中以高层级方式模拟了 verl 的使用场景。

异构部署示例

我们已通过 TransferQueue 实验性地实现了标准化、全流式的分布式工作流。

通过利用 RankAwareSampler 和 StreamingDataLoader 接口，我们构建了精简的微批量级生产者-消费者流水线。这种设计无需手动确定不同并行策略下的数据分发逻辑（这是单一控制器范式中的典型复杂性），从而极大简化了框架设计。

更多详情请参考我们的路线图和 tutorials/05_streaming_dataloader.py。

🚀 快速开始

使用 Python 包

pip install TransferQueue

从源代码安装

从 GitHub 仓库克隆源代码

git clone https://github.com/Ascend/TransferQueue/
cd TransferQueue

从源代码安装
```
pip install .
```

从源代码构建 wheel 包

从 GitHub 仓库克隆源代码

git clone https://github.com/Ascend/TransferQueue/
cd TransferQueue

安装依赖项
```
pip install build
```

构建并安装

python -m build --wheel
pip install dist/*.whl

📊 性能表现

简单场景：常规 Tensor

复杂场景：常规 Tensor + NestedTensor + 非 Tensor

注意：openYuanrong 基准测试仅使用单个 NPU，因此无法体现多 NPU 的扩展性。此外，openYuanrong 的测试硬件环境与其他后端不同。

有关详细的性能基准测试，请参考完整基准测试报告。

压力测试

除吞吐量外，我们还验证了高并发下的稳定性。我们提供了一份压力测试报告，其中展示了在 4 个节点上，8192 个并发客户端向 TransferQueue 写入 2 TB 数据的场景。系统保持稳定，无崩溃或数据丢失。

🛠️ 自定义 TransferQueue

定义您自己的数据检索逻辑

我们提供了一个 BaseSampler 抽象类，它定义了以下接口：

@abstractmethod
def sample(
    self,
    ready_indexes: list[int],
    batch_size: int,
    *args: Any,
    **kwargs: Any,
) -> tuple[list[int], list[int]]:
    """Sample a batch of indices from the ready indices.

    Args:
        ready_indexes: List of global indices for which all required fields of the
        corresponding samples have been produced, and the samples are not labeled as
        consumed in the corresponding task.
        batch_size: Number of samples to select
        *args: Additional positional arguments for specific sampler implementations
        **kwargs: Additional keyword arguments for specific sampler implementations

    Returns:
        List of sampled global indices of length batch_size
        List of global indices of length batch_size that should be labeled as consumed
        (will never be retrieved in the future)

    Raises:
        ValueError: If batch_size is invalid or ready_indexes is insufficient
    """
    raise NotImplementedError("Subclasses must implement sample")

在本设计中，我们通过两个返回值分离了数据获取和数据消费，这使我们能够轻松控制样本替换。我们已实现两种参考设计：SequentialSampler 和 GRPOGroupNSampler。

Sampler 类或实例应在初始化期间传递给 TransferQueueController。在每次 get_meta 调用时，您可以向 Sampler 提供动态采样参数。

from transfer_queue import TransferQueueController, TransferQueueClient, GRPOGroupNSampler, process_zmq_server_info

# Option 1: Pass the sampler class to the TransferQueueController
controller = TransferQueueController.remote(GRPOGroupNSampler)

# Option 2: Pass the sampler instance to the TransferQueueController (if you need custom configuration)
your_own_sampler = YourOwnSampler(config)
controller = TransferQueueController.remote(your_own_sampler)

# Use the sampler
batch_meta = client.get_meta(
    data_fields=["input_ids", "attention_mask"],
    batch_size=8,
    partition_id="train_0",
    task_name="generate_sequences",
)

有关更多详细信息，请参阅 tutorial/05_custom_sampler.py。

如何集成新的存储后端

数据平面的组织方式如下：

  transfer_queue/
  ├── storage/
  │   ├── __init__.py
  │   │── simple_backend.py             # Default distributed storage backend (SimpleStorageUnit) by TQ 
  │   ├── managers/                     # Managers are upper level interfaces that encapsulate the interaction logic with TQ system.
  │   │   ├── __init__.py
  │   │   ├──base.py                    # StorageManager, KVStorageManager, StorageManagerFactory
  │   │   ├──simple_storage_manager.py  # AsyncSimpleStorageManager
  │   │   ├──yuanrong_manager.py        # YuanrongStorageManager
  │   │   └──mooncake_manager.py        # MooncakeStorageManager
  │   └── clients/                      # Clients are lower level interfaces that directly manipulate the target storage backend.
  │   │   ├── __init__.py
  │   │   ├── base.py                   # StorageKVClient, StorageClientFactory
  │   │   ├── yuanrong_client.py        # YuanrongStorageClient
  │   │   ├── mooncake_client.py        # MooncakeStorageClient
  │   │   └── ray_storage_client.py     # RayStorageClient

要将 TransferQueue 与自定义存储后端集成，首先需实现一个继承自 StorageManager 的子类。该子类充当 TransferQueue 系统与目标存储后端之间的适配器。对于基于 KV 的存储后端，您只需继承 KVStorageManager，它可作为所有基于 KV 的后端的通用管理器。

分布式存储后端通常拥有自己的原生客户端，作为存储系统的接口。在这种情况下，可参照 storage/clients 目录中提供的示例，为该客户端编写一个底层适配器。

系统为 StorageManager 和 StorageClient 均提供了工厂类，以方便集成。在工厂类中添加必要的必填参数说明，有助于提升整体用户体验。

✏️ 贡献指南

热烈欢迎贡献！

我们鼓励提出新想法、功能建议和用户体验反馈——您可以随时提交 issue 或 PR。我们将尽快回复。

建议使用 pre-commit 以获得更好的代码格式。

# install pre-commit
pip install pre-commit

# run the following command in your repo folder, then fix the check before committing your code
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always

📑 引用

如果您觉得本仓库对您有帮助，烦请引用我们的论文：

@article{han2025asyncflow,
  title={AsyncFlow: An Asynchronous Streaming RL Framework for Efficient LLM Post-Training},
  author={Han, Zhenyu and You, Ansheng and Wang, Haibo and Luo, Kui and Yang, Guang and Shi, Wenqi and Chen, Menglong and Zhang, Sicheng and Lan, Zeshun and Deng, Chunshi and others},
  journal={arXiv preprint arXiv:2507.01663},
  year={2025}
}

项目介绍

异步高性能流式数据引擎

Apache-2.0 Python 132提交数ray pytorch

定制我的领域

README

规则集