API Reference
Feature Description of TensorFlow ANNC for Graph Compilation Optimization
TensorFlow ANNC provides three optimization functions: TensorFlow graph fusion, XLA graph fusion, and operator optimization. This section describes how to enable each function.
TensorFlow Graph Fusion
Table 1 TensorFlow graph fusion interface shows how to use the TensorFlow graph fusion interface.
[Table 1 TensorFlow graph fusion interface]
Command Line Interface
|
annc-opt
|
Function
|
Enables the graph fusion feature.
|
Parameter Description
|
-I /path/to/save_model.pb: model before graph fusion-O /path/to/new_save_model.pb: model after graph fusionpass: graph fusion policy (Currently, lookup_embedding_hash is supported.)
|
Example
|
annc-opt -I /base_model/wide_and_deep/1/ -O /optimized_model/wide_and_deep/1/ lookup_embedding_hash
cp -r /base_model/wide_and_deep/1/variables /optimized_model/wide_and_deep/1/
|
XLA Graph Fusion
Table 2 XLA graph fusion interface describes the XLA graph fusion interface.
[Table 2 XLA graph fusion interface]
Environment Variable
|
ANNC_FLAGS
|
Function
|
Compiles ANNC and enables XLA graph fusion optimization.
|
Example
|
export ANNC_FLAGS="--graph-opt"
|
Value
|
Enables the feature when the environment variable is --graph-opt.
|
Operator Optimization
The operator optimization interfaces are described as in Table 3 Interface for redundant operator optimization, Table 4 Interface for matrix operator optimization, and Table 5 Interface for Softmax operator optimization.
Table 3 Interface for redundant operator optimization
Environment Variable
|
ENABLE_BISHENG_GRAPH_OPT
|
Function
|
Enables redundant operator optimization.
|
Example
|
export ENABLE_BISHENG_GRAPH_OPT=""
|
Value
|
Enables the feature when the environment variable is not null.
|
Table 4 Interface for matrix operator optimization
Environment Variable
|
ANNC_FLAGS
|
Function
|
Enables matrix operator optimization.
|
Example
|
export ANNC_FLAGS="--gemm-opt"
|
Value
|
Enables the feature when the environment variable is --gemm-opt.
|
Table 5 Interface for Softmax operator optimization
Environment Variable
|
XLA_FLAGS
|
Function
|
Enables Softmax operator optimization.
|
Example
|
export XLA_FLAGS="--xla_cpu_enable_xnnpack=true"
|
Value
|
Enables the feature when the environment variable is --xla_cpu_enable_xnnpack=true.
|
Feature Description of TensorFlow Serving Thread Scheduling
Batch Operator Scheduling
Kunpeng's TensorFlow Serving Thread Scheduling feature provides two configuration options: batch operator scheduling and thread affinity isolation. You can configure the options based on your specific requirements.
To use TensorFlow Serving to start an inference stress test, see section Starting the Service and Performing a Pressure Test in the TensorFlow Serving Porting Guide.
TF Serving Command Line Interface
|
--batch_op_scheduling
|
Function
|
Enables the operator scheduling optimization and XLA thread pool management optimization features.
|
Parameter Type
|
bool
|
Value Range
|
true or false. Set it to true to enable the feature or false to disable the feature.
|
Recommended Scenario
|
Recommended when single-core inference latency meets requirements. This option enhances concurrent processing capability and overall throughput.
|
Recommended Configuration
|
--tensorflow_intra_op_parallelism=1: Sets the intra-operator parallelism degree to 1.--tensorflow_inter_op_parallelism=80: Sets the inter-operator parallelism degree to the number of CPU cores.--batch_op_scheduling=true: Enables the batch operator scheduling feature.
|
Example
|
/path/to/tensorflow_model_server --port=8850 --rest_api_port=8851 --model_base_path=/path/to/saved_model/ --model_name=model --tensorflow_intra_op_parallelism=1 --tensorflow_inter_op_parallelism=80 --batch_op_scheduling=true
|
Thread Affinity Isolation
TF Serving Command Line Interface
|
--task_affinity_isolation
|
Function
|
Enables the thread affinity isolation feature, which offers two isolation methods:
- Sequential core binding allocates TensorFlow computing threads to the first K cores and TF Serving communication threads to remaining cores.
- Interleaved core binding (applicable when hyper-threading is enabled) assigns TensorFlow threads to physical cores and TF Serving communication threads to virtual cores.
|
Parameter Type
|
std::string
|
Parameter Format
|
mode;m-n;k. The default value is 0.
|
Value Range
|
For details, see Thread affinity isolation parameter values.
|
Recommended Scenario
|
- When TensorFlow scheduling is used, sequential core binding is recommended.
- When both batch operator scheduling and thread affinity isolation are used, and hyper-threading is enabled, interleaved core binding is recommended.
|
Example
|
A server has four Non-Uniform Memory Access (NUMA) nodes, each containing 40 physical cores (160 in total) or 80 logical cores (320 in total) with hyper-threading enabled.
|
Table 1 Thread affinity isolation parameter values
Parameter
|
Value Range
|
Description
|
Constraint
|
mode
|
0, 1, or 2
|
0: (OFF) Thread affinity is disabled. 1: (ORDER) Cores are bound in sequence. 2: (INTERVAL) Cores are bound in an interleaved manner.
|
When mode is set to 0, m-n and k are invalid and can be omitted.
|
m-n
|
Available CPU cores
|
The core binding range is [m, n].
|
m ≤ n
|
k
|
Available CPU cores
|
Number of cores allocated to the TensorFlow thread.
|
k ≤ n - m + 1 (the total number of bound cores). When mode is set to 2, k is invalid and can be omitted.
|
Note:
numactl is a tool used to control and manage the NUMA architecture on Linux. It can be installed using Yum.
yum install -y numactl numactl-devel
For example, numactl -C 0-79 -m 0 indicates that the TF Serving service runs on the cores of NUMA node 0, so that CPU resources can be fully utilized. -C and -m specify cores and memory of NUMA node 0, respectively.