RecSDK Docker Image Build Overview
English | 中文
This document is compiled from the Dockerfiles provided in the docker directory, aiming to help developers quickly understand and use these container images.
Quick Reference
- Where to get help
RecSDK Overview
RecSDK is an application-enablement SDK product targeting search, recommendation, and advertising (SRA) scenarios in the Internet market. For the application requirements of search/recommendation/advertising model training, it provides an Atlas-based SRA framework that supports large-scale SRA scenarios and facilitates efficient training of SRA models.
RecSDK features include:
- Basic model training capabilities. Supports single-node single-device training and multi-node multi-device distributed training.
- Recommendation-specific features. Based on RecSDK's sparse table solution, RecSDK provides essential functionality such as feature saving/loading, feature admission, and feature eviction.
- Large-scale sparse table features. Supports multi-level storage across accelerator memory, host memory, and host disk; supports multi-node storage; supports dynamic capacity expansion. Scale can exceed 10 TB.
Key Fields in Image Tags
Image tags follow the convention below for intuitive identification of the software and hardware stack versions contained within the image:
- RecSDK version: RecSDK version number (e.g.,
26.1.0) - OS version: Base operating system codename or version number (e.g.,
ubuntu20.04) - Framework identifier: The ML framework supported by the image (e.g.,
tffor TensorFlow, ortorchfor PyTorch) - Python version: The core interpreter version (e.g.,
py3.7)
Dockerfile Archive Paths
The project currently supports container scenarios divided by underlying frameworks. The corresponding build scripts are located at the following paths:
Note
Dockerfiles will be published simultaneously with the Atlas community and Atlas image registry. They are not yet updated at this time.
RecSDK 26.1.0
| Tag | Dockerfile |
|---|---|
| 26.1.0-ubuntu20.04-tf-py3.7 | Dockerfile |
| 26.1.0-ubuntu22.04-pt-py3.11 | Dockerfile |
| 26.1.0-openEuler22.03-tf-py3.7 | Dockerfile |
| 26.1.0-openEuler22.03-pt-py3.11 | Dockerfile |
Quick Start
Running an Existing Image
If you have already pulled or built a usable image, you can quickly start and enter a shell environment with the following command:
docker run -it \
--name {container_name} \
--net=host \
-m 300g \
-e ASCEND_VISIBLE_DEVICES=0-7 \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver:ro \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v {mount_dir}:{mount_dir} \
{image_name}:{image_tag} \
/bin/bash
Parameter descriptions:
- ASCEND_VISIBLE_DEVICES=0-7 — The current machine has NPU cards 0-7. For 16 cards, set to 0-15. Adjust based on your actual configuration.
- -v /usr/local/Ascend/driver/:/usr/local/Ascend/driver/ — The driver directory mounted into the container. Configure according to your actual driver installation path.
- -v /etc/ascend_install.info:/etc/ascend_install.info — The container needs to mount the driver/firmware installation info.
- -m 300g — Set the available memory in the container. Configure based on your usage.
Building Images Locally
Developers can locally package and build business images in the current system environment. Before doing so, ensure the corresponding packaged artifacts exist in the outer directory (e.g., packages/tf_rec_v1_*.tar.gz, etc.), as the Dockerfile uses COPY to include them.
Build a TensorFlow-based image:
docker build -t recsdk_tf:26.1.0-ubuntu20.04-tf-py3.7 -f docker/Dockerfile.26.1.0-ubuntu20.04-tf-py3.7 .
If you need to target a specific core type (default is a2) when compiling operators or loading framework packages, you can explicitly specify it via the CORE_TYPE build argument:
docker build --build-arg CORE_TYPE=a2 -t recsdk_tf_a2:26.1.0-ubuntu20.04-tf-py3.7 -f docker/Dockerfile.26.1.0-ubuntu20.04-tf-py3.7 .
Secondary Development and Underlying Environment Switching
The image provides convenient environment switching scripts for development and debugging work that requires switching between different underlying dependency frameworks or target chip types:
-
How to perform secondary development:
# Use recsdk_tf:26.1.0-ubuntu20.04-tf-py3.7 as the base image and layer user software on top FROM recsdk_tf:26.1.0-ubuntu20.04-tf-py3.7 RUN apt update -y && \ apt install ... -
Switch Python framework environment (TensorFlow container):
# Activate the RecSDK development environment for TF 1.15.0 source /opt/buildtools/tf1_env/bin/activate # Exit the environment deactivate tf1_env # Activate the RecSDK development environment for TF 2.6.5 source /opt/buildtools/tf2_env/bin/activate # Exit the environment deactivate tf2_env -
Switch Python framework environment (PyTorch container):
# Activate the RecSDK development environment for torch_rec_v1 PT 2.6.0 source /opt/buildtools/torch_v1_pt2.6.0/bin/activate # Exit the environment deactivate torch_v1_pt2.6.0 # Activate the RecSDK development environment for torch_rec_v1 PT 2.7.1 source /opt/buildtools/torch_v1_pt2.7.1/bin/activate # Exit the environment deactivate torch_v1_pt2.7.1 # Activate the RecSDK development environment for torch_rec_v2 PT 2.7.1 source /opt/buildtools/torch_v2_pt2.7.1/bin/activate # Exit the environment deactivate torch_v2_pt2.7.1 -
Switch CANN hardware support packages:
Use the built-in switching script to dynamically adjust the environment variables and symlink paths bound to CANN packages:
source /usr/local/set_cann_env.sh a2 # Switch to Atlas 800T2 training server toolkit and related env vars (system default) source /usr/local/set_cann_env.sh a3 # Switch to Atlas 800T3 super node server toolkit and related env vars source /usr/local/set_cann_env.sh a5 # Switch to Ascend 950 toolkit and related env vars
Support Information and Changelog
Hardware Support Information
- Automatic physical architecture adaptation: Dockerfiles across all versions natively support host architecture detection. They not only cover
x86andARMhardware handling, but also identify the architecture early in the build process viaARCH=$(uname -m), fetching and assembling the corresponding OS-specificGCCconfiguration, system dynamic libraries, and framework distribution versions (i.e., differentwhlpackages andrunpackages) based on logical branching. - Full chip compatibility deployment: The build process avoids limitations tied to a single accelerator card or compute board. By leveraging pre-installed multi-version compute deployment solutions, the corresponding compute operation stacks (ops packages and toolkit packages) for A2/A3/A5 are all deposited and installed into the image base, greatly enhancing the container's distribution scalability and business versatility.
Compatibility Changes
- Abandoned the practice of mixing multiple highly-exclusive framework modules under the global system Python, which previously led to dependency disasters (such as the common
npu_bridgeandnpu_devicecollisions and Slim compatibility issues). The new scripts universally adoptpython3.7 -m venvfor virtual-environment-isolated deployment, ensuring each framework module runs independently without mutual awareness or interference. - For dependency suites highly coupled with hardware logic operations (including important runtime libraries such as GCC 11.2, CMake, OpenMPI, Python, etc.), the system package manager distributions are no longer used. Instead, they are built from source using
make install.
Quick Verification
TensorFlow Image Verification
Refer to the little demo for verification. Before running, switch the CANN environment and Python virtual environment via Secondary Development and Underlying Environment Switching.
PyTorch Image Verification
Refer to the little demo for verification. Before running, switch the CANN environment and Python virtual environment via Secondary Development and Underlying Environment Switching.
License / Disclaimer
- View the license information for RecSDK and Mind series software included in these images.
- As with all container images, pre-installed packages (Python, system libraries, etc.) may be subject to their own licenses.