Inference Service
inference_service is the core AI execution package for IB-Robot. It provides a standardized framework for running end-to-end Machine Learning policies (like ACT, pi0, etc.) on physical robots with strict temporal alignment and zero-copy latency optimizations.
Architecture: Composition over Inheritance
The inference pipeline is decoupled into three pure-Python core components (inference_service.core):
- TensorPreprocessor: Converts raw ROS 2 sensor data (images, joint states) into normalized PyTorch Tensors.
- PureInferenceEngine: A completely stateless, ROS-agnostic GPU execution engine.
- TensorPostprocessor: Denormalizes output action tensors back into physical control commands.
By separating the core math from the ROS 2 transport layer, this package supports two distinct deployment modes, toggleable via a single YAML parameter.
🚀 Execution Modes
Mode A: Monolithic (Single-Machine Zero-Copy)
Best for: Robots equipped with high-performance onboard GPUs (e.g., RTX 4060).
In this mode, lerobot_policy_node.py instantiates an InferenceCoordinator that chains the Preprocessor, Engine, and Postprocessor together.
- Data Flow: Sensor data stays entirely within the RAM/VRAM of the single process. Tensors are passed by reference.
- Performance: Absolute lowest latency. Zero serialization/deserialization overhead.
- Config:
execution_mode: "monolithic"
Mode B: Device-Edge-Cloud Synergy (Distributed)
Best for: Lightweight robots (Device) running on low-power CPUs (e.g., Raspberry Pi) paired with a high-end computation node (Edge) or tower server (Cloud) over a LAN.
To preserve compatibility with the pull-based action_dispatch system without clogging the network with 30fps video streams, the Device node acts as an Asynchronous Proxy.
- Device Node (
lerobot_policy_node.py): Receives the action goal, reads the cameras on-demand, runs the Preprocessor on CPU, and publishes the lightweight Tensor batch to/preprocessed/batch. The action callback is then suspended using an asynchronousthreading.Event. - Edge/Cloud Node (
pure_inference_node.py): Subscribes to the batch, crunches the numbers on the GPU usingPureInferenceEngine, and returns the raw action to/inference/action. - Device Node: Wakes up, runs the Postprocessor, and completes the Action sequence.
- Performance: Achieves "Compute Offloading" perfectly. The Device only sends the exact frames needed for inference (e.g., 20Hz), saving massive network bandwidth.
- Config:
execution_mode: "distributed"
Device Machine (Robot / Sim) GPU Machine (Edge/Cloud)
┌──────────────────────────────┐ ┌──────────────────────────┐
│ action_dispatcher_node │ │ │
│ ↓ │ │ pure_inference_node │
│ lerobot_policy_node (Proxy) │ │ ├─ Subscribe │
│ ├─ TensorPreprocessor (CPU) │ │ │ /preprocessed/batch │
│ ├─ threading.Event │ │ ├─ PureInferenceEngine │
│ └─ TensorPostprocessor(CPU) │ │ │ (GPU) │
│ ↓ Pub ↑ Sub │ │ └─ Publish │
│ /preprocessed /inference │ │ /inference/action │
│ /batch /action │ │ │
└──────────┬──────────┬────────┘ └───────┬──────────┬───────┘
│ │ LAN (same ROS_DOMAIN_ID) │ │
└──────────┴──────────────────────────┴──────────┘
⚙️ Configuration & Usage
The execution mode is controlled seamlessly via your robot_config YAML files. You do not need to change launch files to switch modes on the device.
# src/robot_config/config/robots/your_robot.yaml
control_modes:
model_inference:
inference:
enabled: true
execution_mode: "distributed" # Or "monolithic"
model: so101_act
Launching
Scenario 1: Cross-Machine Distributed Deployment (Recommended for Production)
Both machines must share the same ROS_DOMAIN_ID and be on the same LAN.
Step 1 — On the Robot (Device):
The Device launches only the Edge proxy node (pre/post-processing), without loading GPU models:
export ROS_DOMAIN_ID=<your_domain_id>
ros2 launch robot_config robot.launch.py \
robot_config:=so101_single_arm \
control_mode:=model_inference \
execution_mode:=distributed \
use_sim:=true # For simulation; omit for real hardware
Step 2 — On the GPU Server (Edge/Cloud):
export ROS_DOMAIN_ID=<same_domain_id_as_device>
ros2 launch inference_service cloud_inference.launch.py \
policy_path:=/path/to/models/pretrained_model \
device:=cuda
For models exported through the ATC/SVP toolchain, inference_service can own
the OM wrappers migrated from the original LeRobot patches:
# Generic Ascend ACL .om backend
ros2 launch inference_service cloud_inference.launch.py \
policy_path:=/path/to/pretrained_model \
device:=ascend_om
# SD3403 worker binary protocol backend
ros2 launch inference_service cloud_inference.launch.py \
policy_path:=/path/to/pretrained_model \
device:=ascend_om_3403
device:=ascend_om resolves the .om model from om_model_path in
policy_path/config.json, ASCEND_OM_MODEL_PATH/OM_MODEL_PATH, or common
files under policy_path. device:=ascend_om_3403 also resolves the worker
executable from cpp_executable, SVP_WORKER_EXECUTABLE, or common out/main
layouts. Preprocessing, postprocessing, ROS topics, and distributed transport
remain the existing inference_service pipeline.
For RK3588 / OpenHarmony boards running RKNN Lite, switch the cloud node to:
export ROS_DOMAIN_ID=<same_domain_id_as_device>
ros2 launch inference_service cloud_inference.launch.py \
policy_path:=/path/to/models/pretrained_model \
device:=rknn
device:=rknn still uses the LeRobot metadata under policy_path/config.json
for preprocessing and postprocessing, while expecting the actual RKNN artifact to
live inside policy_path, using model.rknn as the default filename.
Scenario 2: Single-Machine Debug (Development)
Run both Edge + Cloud nodes on one machine by adding cloud_local:=true:
ros2 launch robot_config robot.launch.py \
robot_config:=so101_single_arm \
control_mode:=model_inference \
execution_mode:=distributed \
use_sim:=true \
cloud_local:=true
Verifying Distributed Mode
# 1. Confirm both inference nodes are online
ros2 node list | grep -E 'act_inference|pure_inference'
# Expected:
# /act_inference_node ← Edge (pre/post-processing)
# /pure_inference ← Cloud (GPU inference)
# 2. Confirm distributed topics exist
ros2 topic list | grep -E 'preprocessed|inference/action'
# Expected:
# /preprocessed/batch ← Edge → Cloud
# /inference/action ← Cloud → Edge
# 3. Monitor inference frequency
ros2 topic hz /inference/action
Logging Reference
After launch, each node prints key lifecycle messages for quick status diagnosis:
| Node | Example Log | Meaning |
|---|---|---|
pure_inference |
Waiting for preprocessed batches from edge node... |
Cloud node ready, waiting for Edge data |
pure_inference |
✓ First inference completed: latency=XXms |
First inference succeeded, end-to-end link confirmed |
pure_inference |
[stats] count=XX, avg=XXms, last=XXms |
Performance stats every 5 seconds |
act_inference_node |
✓ First inference complete (distributed): total=XXms |
Edge node completed full inference round-trip |
action_dispatcher |
✓ First inference received: chunk=XX, latency=XXms |
Dispatcher received first executable actions |
action_dispatcher |
[stats] inferences=XX, avg_latency=XXms, queue=XX, hold=XX |
Dispatch stats every 5s; hold = times queue exhausted and last frame held |
🧪 Testing
Because the core components are isolated from ROS, they can be validated entirely offline:
pytest src/inference_service/tests/