X3D-S
Implements training of X3D-S on the Kinetics-400 dataset
Detail
Most of codes are modified according to here
There are some special modification of source repository :
NPU & GPU
- Add some customized yaml configuration items, such as APEX.ENABLE、DIST_BACKEND...
- Ascend-Pytorch-1.5 is not supported
torch.nn.init.trunc_normal, usingtorch.nn.init.normal_instead - Adjusted the order of dependency import to prevent some unknown bugs (
scikit-learn)
NPU
- Group conv3D of Ascend-Pytorch is not supported, so we canceled all group operations in the model
- Remove some irrelevant codes to prevent some unknown bugs (
Segmentation fault (core dumped))
Requirements
Base Environment
- Python == 3.7.5
- GCC >= 4.9
Python Environment
- Installing these error-prone dependencies first:
- PyTorch (raw==1.5 or ascend)
- Ascend-Pytorch Version after August 24 be installed
- torchvision == 0.6.0
- If on Centos arm, please build the source code from here
- PyAV
- If the installation fails on Centos arm, following this issue
- Detectron2
- According to the CUDA version and Pytorch version, build from source code
-
Then, you can use
pip3 install -r requirements.txtto install some simple dependencies -
Building source code
cd X3D # into source code root
# Switch to your prepared environment
python3 setup.py build develop # build slowfast and install the remaining dependencies
Modify Ascend-tookit
cd /usr/local
find / -name fractal_z_3d_2_ncdhw.py
vi path_to_fractal_z_3d_2_ncdhw.py
located method:
1. def fractal_z_3d_2_ncdhw(src, dst, src_format, dst_format,kernel_name="fractal_z_3d_2_ncdhw")
2. modify it according this picture:
2.1. remove `if list(dst_shape) in ....`
2.2. Align the next two lines like this

Dataset
-
Download the Kinetics-400 dataset from here
-
unzip the all packages and merge all folders
-
we get two sets , train set (used to train) and val set (used to test). And each of both has 400 folders
# format of data folder |-data |-train |- video type 1 |- video 1 |- video 2 ... |- video type 2 ... |- video type 400 |-val |- video type 1 |- video type 2 ... |- video type 400 -
build train.csv, val.csv, test.csv, and put them in the same folder
# format of data path folder |- data_path |- train.csv |- val.csv |- test.csvtrain.csv consists of train set
val.csv is same as test.csv, and consists of test set
# format of data path csv is: path_to_video_1 label_1 path_to_video_2 label_2 ... path_to_video_N label_N
-
- check if the all videos are lossless according to the scripts provided by project mmaction2 . Here, we provide the list of corrupted videos that have been checked out
- remove the those corrupted videos from the three csv
Training
To train a model, run main.py with the desired model architecture and the path to the ImageNet dataset:
Note:the
real_data_pathis path of csv folder mentioned above
# training 1p (300 epoch)
bash ./test/train_full_1p.sh --data_path=real_data_path
# training 8p (300 epoch)
bash ./test/train_full_8p.sh --data_path=real_data_path
# training performance 1p (1 epoch)
bash ./test/train_performance_1p.sh --data_path=real_data_path
# training performance 8p (3 epoch)
bash ./test/train_performance_8p.sh --data_path=real_data_path
# testing 8p
bash test/train_eval_8p.sh --data_path=real_data_path --pth_path=real_pre_train_model_path
# train_finetune_1p.sh
bash test/train_finetune_1p.sh --data_path=real_data_path --pth_path=real_pre_train_model_path --num_classes=default_400
Log path: ./stdout.log
Training Result
Due to the calculation cast a long time, we choose to run the full data, and the NPU-ACC is aligned with the GPU-ACC (as many epochs as possible).
| Device | FPS | Top1-ACC 10-view | Batch Size | Epochs | AMP |
|---|---|---|---|---|---|
| 1P-GPU | 10.39 | 6.67% | 96 | 1/300 | O2-128.0 |
| 1P-NPU | 5.38 | 6.18% | 96 | 1/300 | O2-128.0 |
| 1P-NPU-白名单 | 5.35 | 6.36% | 96 | 1/300 | O2-128.0 |
| 8P-GPU | 1137.49 | 37.56% | 256 | 30/300 | O2-128.0 |
| 8P-NPU | 529.24 | 39.67% | 256 | 30/300 | O2-128.0 |
| 8P-NPU-fusedSGD | 510.66 | 5.80% | 256 | 2/300 | O2-128.0 |
- Testing result: Top1-ACC of 8P-NPU and 8P-GPU training (30 epochs)

Performance Optimization
According to the above, it can be concluded that the accuracy(Top1-ACC 10-view) of 8P-GPU and 8P-NPU is little different. But performance(FPS) of 8P-NPU is 50% of 8P-GPU's.
So we made the following analysis and improvement:
- find the dynamic operators following here, but the operators is very basic, and we can not identify them from our big model.

-
check the profile of NPU through chrome tracing
-
In order to improve the low perfomance of Transpose, we first generate the
cann profilingfollowing here, then we extract the two operators, TransposeD and TransData.- if TransposeD
Consuming time > 10s, add its Input Shapes to White List (/usr/local/Ascend/ascend-toolkit/5.0.2/x86_64-linux/opp/op_impl/built-in/ai_core/tbe/impl/dynamic/transpose.py) - if TransData
Consuming time > 10s & Input Formats == 'NCHW' & Output Formats == 'NC1HWC0', add its Input Shapes to White List (/usr/local/Ascend/ascend-toolkit/5.0.2/x86_64-linux/opp/op_impl/built-in/ai_core/tbe/impl/four_2_five.py) - if TransData
Consuming time > 10s & Input Formats == 'NC1HWC0' & Output Formats == 'NCHW', add its Input Shapes to White List (/usr/local/Ascend/ascend-toolkit/5.0.2/x86_64-linux/opp/op_impl/built-in/ai_core/tbe/impl/five_2_four.py)
- if TransposeD
After Optimization

ELSE
Iessues and PRs about this project
- invalid gradient https://gitcode.com/ascend/modelzoo/issues/I452ZB https://gitcode.com/ascend/pytorch-develop/pulls/2438
- optimizer error https://gitcode.com/ascend/pytorch-develop/pulls/2438
- pyav install on CentOS arm https://gitcode.com/ascend/modelzoo/issues/I48AP3
- scikit-learn cannot allocate memory in static TLS https://gitcode.com/ascend/modelzoo/issues/I48QNY
Statement
For details about the public address of the code in this repository, you can get from the file public_address_statement.md