TensorCast and ServingCast Quick Start
1. Overview
msModeling provides single-model performance simulation and service-level inference simulation. This guide is for first-time TensorCast and ServingCast users. It walks through environment checks, LLM text-generation simulation, throughput optimization, and end-to-end service simulation, helping you understand the main inputs, outputs, and usage scenarios.
1.1 Before You Start
Experience Map (core operations take about 10 minutes)
Recommended order: Step 1 is the environment baseline. Step 2 runs TensorCast single-model simulation. Steps 3 and 4 cover ServingCast throughput optimization and service simulation, and can be selected as needed.
| Step | Stage | Core Module | Reference Operation Time | Suggested Concept Study |
|---|---|---|---|---|
| 1 | Environment setup | msModeling |
2 minutes | 5 minutes |
| 2 | Single-model simulation | TensorCast |
1 minute | 10 minutes |
| 3 | Throughput optimization | ServingCast / Throughput Optimizer |
2 minutes | 15 minutes |
| 4 | Service simulation | ServingCast |
2 minutes | 15 minutes |
1.2 Environment Preparation
👉 Important: Complete the environment installation and configuration in the msModeling Install Guide first.
Caution
This guide assumes commands are run from the msModeling repository root. If you run them from another directory, set PYTHONPATH first. Otherwise, errors such as No module named cli or No module named tensor_cast may occur.
2. Steps
Note
The commands below can be copied and run directly. Use TEST_DEVICE to complete the workflow first, then replace it with the target hardware device and workload model.
2.1 Environment: Confirm Runtime Setup
Before starting, complete the environment setup in the msModeling Install Guide, including repository cloning, virtual environment creation, dependency installation, and PYTHONPATH configuration.
The following commands assume you are in the msModeling repository root. If not, set:
export PYTHONPATH=/path/to/msmodeling:$PYTHONPATH
TensorCast reads model configuration files from Hugging Face. If the environment cannot access Hugging Face directly, set a mirror:
export HF_ENDPOINT="https://hf-mirror.com"
Use the following commands to confirm the command-line entry points are available:
python -m cli.inference.text_generate --help
python -m serving_cast.main --help
If the commands do not print help information, check that the virtual environment is activated, dependencies are installed, and PYTHONPATH points to the msModeling repository root.
2.2 Single-Model Simulation: Run TensorCast Text Generation
TensorCast performs performance modeling for PyTorch programs. It does not execute the model on a real accelerator. Instead, it intercepts the computation graph and estimates operator latency, memory usage, and overall inference performance based on the target device profile.
Note
TensorCast prints operator-level performance summaries, total execution time, TPS/Device, and memory usage by default. If --chrome-trace is specified, it can also generate a Chrome Trace file for timeline analysis.
2.2.1 Run LLM Text-Generation Simulation
python -m cli.inference.text_generate Qwen/Qwen3-32B \
--num-queries 2 \
--query-length 3500 \
--device TEST_DEVICE
2.2.2 Check Simulation Results
If the command succeeds, the terminal prints output similar to:
Model compilation and execution time: 0.192 s
---------------------------------------------- -------------- ------------ ----------
Name analytic total analytic avg # of Calls
---------------------------------------------- -------------- ------------ ----------
tensor_cast.static_quant_linear.default 884.004ms 1.973ms 448
tensor_cast.attention.default 259.855ms 4.060ms 64
...
Total time for analytic: 1.744s
[analytic] TPS/Device: 4013 token/s
Total device memory: 64.000 GB
Key metrics:
analytic total: Estimated total operator latency.analytic avg: Estimated average latency per operator call.# of Calls: Number of operator calls.TPS/Device: Tokens per second per device.Total device memory: Estimated memory usage, including weights, KV cache, and activations.
Success criteria:
- The terminal prints an operator-level performance table.
- The output includes
Total time for analyticor[analytic] TPS/Device. - The output includes memory estimation, such as
Total device memory.
2.2.3 Generate Chrome Trace (Optional)
To inspect a more fine-grained timeline, add --chrome-trace:
python -m cli.inference.text_generate Qwen/Qwen3-32B \
--num-queries 2 \
--query-length 3500 \
--device TEST_DEVICE \
--chrome-trace ./tensorcast_trace.json
After generation, open the trace file with chrome://tracing or MindStudio Insight.
2.3 Throughput Optimization: Run the ServingCast Throughput Optimizer
The ServingCast throughput optimizer searches for the best parallel strategy and batch configuration under SLO constraints such as TTFT and TPOT. It helps estimate the maximum serving throughput of a target model on target hardware.
Note
PD colocated means Prefill and Decode run in the same instance. It is suitable for quickly evaluating overall service throughput. To evaluate Prefill and Decode separately, see the Throughput Optimizer Guide.
2.3.1 Run Throughput Optimization
The following command quickly evaluates a PD colocated scenario. For the first run, no explicit search dimensions are specified, so the tool uses the default TP search range. If the run takes too long, reduce --num-devices or specify --tp-sizes in advanced usage to narrow the search space.
python -m cli.inference.throughput_optimizer Qwen/Qwen3-32B \
--device TEST_DEVICE \
--num-devices 8 \
--input-length 3500 \
--output-length 1500 \
--quantize-linear-action W8A8_DYNAMIC \
--quantize-attention-action DISABLED \
--tpot-limits 50
2.3.2 Check Optimization Results
If the command succeeds, the terminal prints candidate configurations and throughput metrics. Focus on:
TP/DP: Recommended parallel strategy.batch size: Batch size that satisfies the SLO constraints.TTFT/TPOT: Time to first token and time per output token.token throughput: System-level token throughput.
Success criteria:
- The terminal prints candidate or best configurations.
- The output includes throughput, TTFT, and TPOT metrics.
- No model configuration loading failure or parameter conflict is reported.
2.4 Service Simulation: Run End-to-End Serving Simulation
ServingCast service simulation uses YAML files to describe instance groups, request workloads, and serving limits. It can simulate end-to-end serving scenarios with multiple instances and requests, and outputs system-level metrics such as E2E_TIME, TTFT, TPOT, request throughput, and token throughput.
2.4.1 Inspect Example Configurations
The repository includes example configurations that can be used directly:
ls serving_cast/example/instances.yaml serving_cast/example/common.yaml
The two configuration files are used as follows:
| Config File | Purpose |
|---|---|
instances.yaml |
Describes one or more instance groups, such as role, instance count, and TP/DP parallelism. |
common.yaml |
Describes global settings, such as model structure, request workload, serving limits, and simulation parameters. |
The example configurations use TEST_DEVICE and Qwen/Qwen3-32B by default. To change the model, request length, or request count, edit common.yaml. To change instance count, device count, or parallel strategy, edit instances.yaml.
2.4.2 Run Service Simulation
python -m serving_cast.main \
--instance_config_path=./serving_cast/example/instances.yaml \
--common_config_path=./serving_cast/example/common.yaml
2.4.3 Check Service Simulation Results
When simulation finishes, the console prints a performance summary similar to:
E2E_TIME(s) TTFT(s) TPOT(s) INPUT_TOKENS OUTPUT_TOKENS OUTPUT_TOKEN_THROUGHPUT(tok/s)
AVERAGE 1052.591 0.378 0.301 1500.0 3500.0 3.327
MIN 1050.000 0.300 0.300 1500.0 3500.0 2.978
MAX 1175.500 0.600 0.336 1500.0 3500.0 3.334
======== Overall Summary ========
request_throughput(req/s) 0.082
input_token_throughput(tok/s) 122.399
output_token_throughput(tok/s) 285.598
Metric descriptions:
E2E_TIME: End-to-end latency for a single request.TTFT: Time to first token.TPOT: Time per output token after the first token.request_throughput: System-level request throughput.input_token_throughput/output_token_throughput: Aggregated token throughput.
Success criteria:
- The terminal prints a request-level statistics table.
- The output includes
Overall Summary. - The output includes
request_throughput,input_token_throughput, oroutput_token_throughput.
2.4.4 Enable Profiling (Optional)
To obtain more fine-grained system performance information, add profiling parameters:
python -m serving_cast.main \
--instance_config_path=./serving_cast/example/instances.yaml \
--common_config_path=./serving_cast/example/common.yaml \
--enable_profiling \
--profiling_output_path=./profiling_results
Raw profiling results are saved under profiling_output_path/{$time_stamp}. Parsed results are saved under profiling_output_path/{$time_stamp}_parsed_result. The parsed output contains chrome_tracing.json and profiler.db, which can be opened with chrome://tracing or MindStudio Insight.
3. Validate Results and Next Steps
If the commands above succeed, you have completed the core TensorCast and ServingCast workflow:
- TensorCast: single-model text-generation performance simulation with operator latency, TPS/Device, and memory estimates.
- Throughput Optimizer: throughput optimization under SLO constraints with recommended parallel strategy and throughput metrics.
- ServingCast: end-to-end service simulation with TTFT, TPOT, request throughput, and token throughput.
Common issues:
- If model configuration cannot be downloaded, check network access to Hugging Face or set the
HF_ENDPOINTmirror. - If
cli,tensor_cast, orserving_castcannot be imported, confirm that you are in the repository root or thatPYTHONPATHis set correctly. - If throughput optimization takes too long, reduce the search range, such as lowering
--num-devicesor explicitly specifying--tp-sizes.
Continue with: