RFC: Support Prefix Cache Hit Rate in CLI text_generate / throughput_optimizer
Metadata
| Item | Content |
|---|---|
| Status | Approved |
| Author | yaohan404, Codex |
| Created Date | 2026-3-23 |
| Related Links | None |
1. Summary
This RFC adds --prefix-cache-hit-rate to the following CLIs so they can quickly estimate performance with prefix cache enabled:
cli.inference.text_generatecli.inference.throughput_optimizer
The first version is a token-level approximation only. It does not model block-level hits, prefix-cache management overhead, or standalone resident prefix-cache memory.
2. Goals and Non-goals
Goals:
- allow users to configure prefix cache hit rate from CLI
- let
text_generateestimate latency under prefix cache - let
throughput_optimizerinclude prefix cache impact in prefill modeling
Non-goals:
- do not modify the underlying performance model
- no block-level hit modeling
- no hash / lookup / eviction / replacement modeling
- no serving-system-level prefix-cache simulation
3. Core Semantics
- prefix cache only affects
prefill - hit rate is approximated at
tokengranularity - requests in the same batch are assumed to share prompt length and hit rate
3.1 text_generate
Original input:
context_length = Cquery_length = Q
If H = floor(Q * hit_rate) query tokens are hit, rewrite the request as:
effective_context_length = C + Heffective_query_length = Q - H
At the modeling layer:
RequestInfo.query_len = effective_query_lengthRequestInfo.seq_len = effective_context_length + effective_query_length
So total seq_len stays unchanged, while the number of tokens that still need prefill is reduced.
Example:
- original:
context_length = 1000,query_length = 200 hit_rate = 0.5- result:
effective_context_length = 1100,effective_query_length = 100,seq_len = 1200
3.2 throughput_optimizer
Internally introduce:
cached_prefix_tokens = floor(input_length * prefix_cache_hit_rate)effective_input_length = input_length - cached_prefix_tokens
Policy:
- all prefill-related paths use
effective_input_length - all decode-related paths keep the original logic
4. Design
4.1 CLI and config
Add this argument to both entrypoints:
--prefix-cache-hit-rate
Constraints:
- type:
float - default:
0.0 - range:
[0, 1) - examples use
0.5, not50%
Add this field to UserInputConfig:
prefix_cache_hit_rate: float = 0.0
4.2 text_generate rewrite point
Compute effective lengths in UserInputConfig.get_request_info() so downstream code does not need to rewrite lengths again.
4.3 throughput_optimizer integration
The effective-length semantics should be introduced in the shared forward-shape construction path rather than only inside one optimizer class.
For aggregation mode:
- prefill wave capacity uses
effective_input_length - prefill latency uses the prefix-cache-adjusted input length
- decode latency keeps the original logic
TTFTdecreases with prefill latency
Here, prefill_batch_size = max_prefill_tokens // effective_input_length means the number of requests that fit in one prefill wave under the prefill token budget. It does not change the user-visible batch_size.
For disaggregation mode:
disaggregation-prefilluseseffective_input_lengthdisaggregation-decodeignores prefix cache
4.4 max_prefill_tokens
After prefix cache is introduced, the following logic must use effective_input_length:
- validation against
max_prefill_tokens prefill_batch_size = max_prefill_tokens // effective_input_lengthin aggregation mode
5. Metrics and Boundaries
5.1 Metric semantics
prefill latency: affected by prefix cachedecode latency: not directly affected by prefix cacheTTFT: reduced when prefill latency is reducedTPOT: may change only if its displayed definition includesTTFT; that does not mean decode is optimized
5.2 Boundary conditions
In text_generate, if both are specified:
--decode--prefix-cache-hit-rate > 0
the tool still runs, emits a warning, and ignores prefix cache hit rate.
The current implementation also requires:
effective_query_len >= 1effective_input_length >= 1
Otherwise, the scenario is unsupported in this version.
5.3 Memory semantics
This scheme only approximates reduced compute for the current request:
text_generatekeeps totalseq_lenunchanged- no standalone resident prefix-cache memory is modeled
So reported memory numbers should not be interpreted as total cache residency of a real serving system.
5.4 Out of scope for v1
- block-level hits and partial block reuse
- non-uniform hit distribution
- cache management overhead
- extra decode-stage optimizations
- high-fidelity serving-system simulation
6. Testing and Acceptance
Argument tests:
- default is
0.0 0.5is valid-0.1and1.0are invalid- inputs leading to effective length
0are invalid text_generate --decodewith--prefix-cache-hit-rate > 0should emit a warning
text_generate tests:
context_length = 1000,query_length = 200,hit_rate = 0.5- verify effective lengths are
1100and100 - verify
seq_len = 1200 - verify
hit_rate = 0matches original behavior
throughput_optimizer tests:
input_length = 200,hit_rate = 0.5- verify
effective_input_length = 100 - verify aggregation prefill uses effective length
- verify disaggregation prefill uses effective length
- verify disaggregation decode keeps the original logic
- verify
max_prefill_tokensvalidation and prefill-wave capacity both use effective length - verify invalid inputs return non-zero exit code
7. Future Work
If higher fidelity is needed later:
- add block-level prefix-cache modeling
- add hit-distribution modeling
- add prefix-cache management and memory modeling
- extend
serving_castintegration for prefix-cache scenarios