API Reference

Feature Description of TensorFlow ANNC for Graph Compilation Optimization

TensorFlow ANNC provides three optimization functions: TensorFlow graph fusion, XLA graph fusion, and operator optimization. This section describes how to enable each function.

TensorFlow Graph Fusion

Table 1 TensorFlow graph fusion interface shows how to use the TensorFlow graph fusion interface.

[Table 1 TensorFlow graph fusion interface]

Command Line Interface	annc-opt
Function	Enables the graph fusion feature.
Parameter Description	`-I /path/to/save_model.pb`: model before graph fusion `-O /path/to/new_save_model.pb`: model after graph fusion `pass`: graph fusion policy (Currently, `lookup_embedding_hash` is supported.)
Example	annc-opt -I /base_model/wide_and_deep/1/ -O /optimized_model/wide_and_deep/1/ lookup_embedding_hash cp -r /base_model/wide_and_deep/1/variables /optimized_model/wide_and_deep/1/

XLA Graph Fusion

Table 2 XLA graph fusion interface describes the XLA graph fusion interface.

[Table 2 XLA graph fusion interface]

Environment Variable	ANNC_FLAGS
Function	Compiles ANNC and enables XLA graph fusion optimization.
Example	export ANNC_FLAGS="--graph-opt"
Value	Enables the feature when the environment variable is `--graph-opt`.

Operator Optimization

The operator optimization interfaces are described as in Table 3 Interface for redundant operator optimization, Table 4 Interface for matrix operator optimization, and Table 5 Interface for Softmax operator optimization.

Table 3 Interface for redundant operator optimization

Environment Variable	ENABLE_BISHENG_GRAPH_OPT
Function	Enables redundant operator optimization.
Example	export ENABLE_BISHENG_GRAPH_OPT=""
Value	Enables the feature when the environment variable is not null.

Table 4 Interface for matrix operator optimization

Environment Variable	ANNC_FLAGS
Function	Enables matrix operator optimization.
Example	export ANNC_FLAGS="--gemm-opt"
Value	Enables the feature when the environment variable is `--gemm-opt`.

Table 5 Interface for Softmax operator optimization

Environment Variable	XLA_FLAGS
Function	Enables Softmax operator optimization.
Example	export XLA_FLAGS="--xla_cpu_enable_xnnpack=true"
Value	Enables the feature when the environment variable is `--xla_cpu_enable_xnnpack=true`.

Feature Description of TensorFlow Serving Thread Scheduling

Batch Operator Scheduling

Kunpeng's TensorFlow Serving Thread Scheduling feature provides two configuration options: batch operator scheduling and thread affinity isolation. You can configure the options based on your specific requirements.

To use TensorFlow Serving to start an inference stress test, see section Starting the Service and Performing a Pressure Test in the TensorFlow Serving Porting Guide.

TF Serving Command Line Interface	--batch_op_scheduling
Function	Enables the operator scheduling optimization and XLA thread pool management optimization features.
Parameter Type	bool
Value Range	`true` or `false`. Set it to `true` to enable the feature or `false` to disable the feature.
Recommended Scenario	Recommended when single-core inference latency meets requirements. This option enhances concurrent processing capability and overall throughput.
Recommended Configuration	`--tensorflow_intra_op_parallelism=1`: Sets the intra-operator parallelism degree to 1. `--tensorflow_inter_op_parallelism=80`: Sets the inter-operator parallelism degree to the number of CPU cores. `--batch_op_scheduling=true`: Enables the batch operator scheduling feature.
Example	/path/to/tensorflow_model_server --port=8850 --rest_api_port=8851 --model_base_path=/path/to/saved_model/ --model_name=model --tensorflow_intra_op_parallelism=1 --tensorflow_inter_op_parallelism=80 --batch_op_scheduling=true

Thread Affinity Isolation

TF Serving Command Line Interface	--task_affinity_isolation
Function	Enables the thread affinity isolation feature, which offers two isolation methods: Sequential core binding allocates TensorFlow computing threads to the first K cores and TF Serving communication threads to remaining cores. Interleaved core binding (applicable when hyper-threading is enabled) assigns TensorFlow threads to physical cores and TF Serving communication threads to virtual cores.
Parameter Type	std::string
Parameter Format	mode;m-n;k. The default value is `0`.
Value Range	For details, see Thread affinity isolation parameter values.
Recommended Scenario	When TensorFlow scheduling is used, sequential core binding is recommended. When both batch operator scheduling and thread affinity isolation are used, and hyper-threading is enabled, interleaved core binding is recommended.
Example	A server has four Non-Uniform Memory Access (NUMA) nodes, each containing 40 physical cores (160 in total) or 80 logical cores (320 in total) with hyper-threading enabled. For TensorFlow scheduling mode, use these reference parameters: numactl -C 0-79 -m 0 /path/to/tensorflow_model_server --port=8850 --rest_api_port=8851 --model_base_path=/path/to/saved_model/ --model_name=model --tensorflow_intra_op_parallelism=75 --tensorflow_inter_op_parallelism=75 --task_affinity_isolation="1;0-79;75" With `--batch_op_scheduling` enabled, set `--tensorflow_inter_op_parallelism` to match the physical core count, use these reference parameters: numactl -C 0-79 -m 0 /path/to/tensorflow_model_server --port=8850 --rest_api_port=8851 --model_base_path=/path/to/saved_model/ --model_name=model --tensorflow_intra_op_parallelism=1 --tensorflow_inter_op_parallelism=40 --batch_op_scheduling=true --task_affinity_isolation="2;0-79"

Table 1 Thread affinity isolation parameter values

Parameter	Value Range	Description	Constraint
mode	`0`, `1`, or `2`	`0`: (OFF) Thread affinity is disabled. `1`: (ORDER) Cores are bound in sequence. `2`: (INTERVAL) Cores are bound in an interleaved manner.	When `mode` is set to `0`, `m-n` and `k` are invalid and can be omitted.
m-n	Available CPU cores	The core binding range is [m, n].	m ≤ n
k	Available CPU cores	Number of cores allocated to the TensorFlow thread.	k ≤ n - m + 1 (the total number of bound cores). When `mode` is set to `2`, `k` is invalid and can be omitted.

Note: numactl is a tool used to control and manage the NUMA architecture on Linux. It can be installed using Yum.
yum install -y numactl numactl-devel
For example, numactl -C 0-79 -m 0 indicates that the TF Serving service runs on the cores of NUMA node 0, so that CPU resources can be fully utilized. -C and -m specify cores and memory of NUMA node 0, respectively.