API Reference

Feature Description of TensorFlow ANNC for Graph Compilation Optimization

TensorFlow ANNC provides three optimization functions: TensorFlow graph fusion, XLA graph fusion, and operator optimization. This section describes how to enable each function.

TensorFlow Graph Fusion

Table 1 TensorFlow graph fusion interface shows how to use the TensorFlow graph fusion interface.

[Table 1 TensorFlow graph fusion interface]

Command Line Interface

annc-opt

Function

Enables the graph fusion feature.

Parameter Description

  • -I /path/to/save_model.pb: model before graph fusion
  • -O /path/to/new_save_model.pb: model after graph fusion
  • pass: graph fusion policy (Currently, lookup_embedding_hash is supported.)

Example

annc-opt -I /base_model/wide_and_deep/1/ -O /optimized_model/wide_and_deep/1/ lookup_embedding_hash
cp -r /base_model/wide_and_deep/1/variables /optimized_model/wide_and_deep/1/

XLA Graph Fusion

Table 2 XLA graph fusion interface describes the XLA graph fusion interface.

[Table 2 XLA graph fusion interface]

Environment Variable

ANNC_FLAGS

Function

Compiles ANNC and enables XLA graph fusion optimization.

Example

export ANNC_FLAGS="--graph-opt"

Value

Enables the feature when the environment variable is --graph-opt.

Operator Optimization

The operator optimization interfaces are described as in Table 3 Interface for redundant operator optimization, Table 4 Interface for matrix operator optimization, and Table 5 Interface for Softmax operator optimization.

Table 3 Interface for redundant operator optimization

Environment Variable

ENABLE_BISHENG_GRAPH_OPT

Function

Enables redundant operator optimization.

Example

export ENABLE_BISHENG_GRAPH_OPT=""

Value

Enables the feature when the environment variable is not null.

Table 4 Interface for matrix operator optimization

Environment Variable

ANNC_FLAGS

Function

Enables matrix operator optimization.

Example

export ANNC_FLAGS="--gemm-opt"

Value

Enables the feature when the environment variable is --gemm-opt.

Table 5 Interface for Softmax operator optimization

Environment Variable

XLA_FLAGS

Function

Enables Softmax operator optimization.

Example

export XLA_FLAGS="--xla_cpu_enable_xnnpack=true"

Value

Enables the feature when the environment variable is --xla_cpu_enable_xnnpack=true.

Feature Description of TensorFlow Serving Thread Scheduling

Batch Operator Scheduling

Kunpeng's TensorFlow Serving Thread Scheduling feature provides two configuration options: batch operator scheduling and thread affinity isolation. You can configure the options based on your specific requirements.

To use TensorFlow Serving to start an inference stress test, see section Starting the Service and Performing a Pressure Test in the TensorFlow Serving Porting Guide.

TF Serving Command Line Interface

--batch_op_scheduling

Function

Enables the operator scheduling optimization and XLA thread pool management optimization features.

Parameter Type

bool

Value Range

true or false. Set it to true to enable the feature or false to disable the feature.

Recommended Scenario

Recommended when single-core inference latency meets requirements. This option enhances concurrent processing capability and overall throughput.

Recommended Configuration

  • --tensorflow_intra_op_parallelism=1: Sets the intra-operator parallelism degree to 1.
  • --tensorflow_inter_op_parallelism=80: Sets the inter-operator parallelism degree to the number of CPU cores.
  • --batch_op_scheduling=true: Enables the batch operator scheduling feature.

Example

/path/to/tensorflow_model_server  --port=8850 --rest_api_port=8851 --model_base_path=/path/to/saved_model/ --model_name=model --tensorflow_intra_op_parallelism=1 --tensorflow_inter_op_parallelism=80 --batch_op_scheduling=true

Thread Affinity Isolation

TF Serving Command Line Interface

--task_affinity_isolation

Function

Enables the thread affinity isolation feature, which offers two isolation methods:

  • Sequential core binding allocates TensorFlow computing threads to the first K cores and TF Serving communication threads to remaining cores.
  • Interleaved core binding (applicable when hyper-threading is enabled) assigns TensorFlow threads to physical cores and TF Serving communication threads to virtual cores.

Parameter Type

std::string

Parameter Format

mode;m-n;k. The default value is 0.

Value Range

For details, see Thread affinity isolation parameter values.

Recommended Scenario

  • When TensorFlow scheduling is used, sequential core binding is recommended.
  • When both batch operator scheduling and thread affinity isolation are used, and hyper-threading is enabled, interleaved core binding is recommended.

Example

A server has four Non-Uniform Memory Access (NUMA) nodes, each containing 40 physical cores (160 in total) or 80 logical cores (320 in total) with hyper-threading enabled.

  • For TensorFlow scheduling mode, use these reference parameters:
    numactl -C 0-79 -m 0 /path/to/tensorflow_model_server  --port=8850 --rest_api_port=8851 --model_base_path=/path/to/saved_model/ --model_name=model --tensorflow_intra_op_parallelism=75 --tensorflow_inter_op_parallelism=75 --task_affinity_isolation="1;0-79;75"
  • With --batch_op_scheduling enabled, set --tensorflow_inter_op_parallelism to match the physical core count, use these reference parameters:
    numactl -C 0-79 -m 0 /path/to/tensorflow_model_server  --port=8850 --rest_api_port=8851 --model_base_path=/path/to/saved_model/ --model_name=model --tensorflow_intra_op_parallelism=1 --tensorflow_inter_op_parallelism=40 --batch_op_scheduling=true --task_affinity_isolation="2;0-79"

Table 1 Thread affinity isolation parameter values

Parameter

Value Range

Description

Constraint

mode

0, 1, or 2

  • 0: (OFF) Thread affinity is disabled.
  • 1: (ORDER) Cores are bound in sequence.
  • 2: (INTERVAL) Cores are bound in an interleaved manner.

When mode is set to 0, m-n and k are invalid and can be omitted.

m-n

Available CPU cores

The core binding range is [m, n].

m ≤ n

k

Available CPU cores

Number of cores allocated to the TensorFlow thread.

k ≤ n - m + 1 (the total number of bound cores). When mode is set to 2, k is invalid and can be omitted.

note Note: numactl is a tool used to control and manage the NUMA architecture on Linux. It can be installed using Yum.

yum install -y numactl numactl-devel

For example, numactl -C 0-79 -m 0 indicates that the TF Serving service runs on the cores of NUMA node 0, so that CPU resources can be fully utilized. -C and -m specify cores and memory of NUMA node 0, respectively.