dd6fe347创建于 4月9日历史提交

X3D-S

Implements training of X3D-S on the Kinetics-400 dataset

Detail

Most of codes are modified according to here

There are some special modification of source repository :

NPU & GPU

Add some customized yaml configuration items, such as APEX.ENABLE、DIST_BACKEND...
Ascend-Pytorch-1.5 is not supported torch.nn.init.trunc_normal , using torch.nn.init.normal_ instead
Adjusted the order of dependency import to prevent some unknown bugs (scikit-learn)

NPU

Group conv3D of Ascend-Pytorch is not supported, so we canceled all group operations in the model
Remove some irrelevant codes to prevent some unknown bugs (Segmentation fault (core dumped))

Requirements

Base Environment

Python == 3.7.5
GCC >= 4.9

Python Environment

Installing these error-prone dependencies first:

PyTorch （raw==1.5 or ascend）
- Ascend-Pytorch Version after August 24 be installed
torchvision == 0.6.0
- If on Centos arm, please build the source code from here
PyAV
- If the installation fails on Centos arm, following this issue
Detectron2
- According to the CUDA version and Pytorch version, build from source code

Then, you can use pip3 install -r requirements.txt to install some simple dependencies
Building source code

cd X3D  # into source code root

# Switch to your prepared environment

python3 setup.py build develop  # build slowfast and install the remaining dependencies

Modify Ascend-tookit

cd /usr/local
find / -name fractal_z_3d_2_ncdhw.py
vi path_to_fractal_z_3d_2_ncdhw.py

located method: 
    1. def fractal_z_3d_2_ncdhw(src, dst, src_format, dst_format,kernel_name="fractal_z_3d_2_ncdhw")
	2. modify it according this picture:
		2.1. remove `if list(dst_shape) in ....`
		2.2. Align the next two lines like this

Dataset

Download the Kinetics-400 dataset from here

unzip the all packages and merge all folders

we get two sets , train set (used to train) and val set (used to test). And each of both has 400 folders

# format of data folder

|-data
	 |-train
	 	|- video type 1
	 		|- video 1
	 		|- video 2
	 		...
	 	|- video type 2
	 	...
	 	|- video type 400
	 |-val
	 	|- video type 1
	 	|- video type 2
	 	...
	 	|- video type 400

build train.csv, val.csv, test.csv, and put them in the same folder

# format of data path folder
|- data_path
	|- train.csv
	|- val.csv
	|- test.csv

train.csv consists of train set

val.csv is same as test.csv, and consists of test set

# format of data path csv is:

path_to_video_1 label_1
path_to_video_2 label_2
...
path_to_video_N label_N

check if the all videos are lossless according to the scripts provided by project mmaction2 . Here, we provide the list of corrupted videos that have been checked out
remove the those corrupted videos from the three csv

Training

To train a model, run main.py with the desired model architecture and the path to the ImageNet dataset:

Note：the real_data_path is path of csv folder mentioned above

# training 1p (300 epoch)
bash ./test/train_full_1p.sh --data_path=real_data_path

# training 8p (300 epoch)
bash ./test/train_full_8p.sh --data_path=real_data_path

# training performance 1p (1 epoch)
bash ./test/train_performance_1p.sh --data_path=real_data_path

# training performance 8p (3 epoch)
bash ./test/train_performance_8p.sh --data_path=real_data_path

# testing 8p
bash test/train_eval_8p.sh --data_path=real_data_path --pth_path=real_pre_train_model_path

# train_finetune_1p.sh
bash test/train_finetune_1p.sh --data_path=real_data_path --pth_path=real_pre_train_model_path --num_classes=default_400

Log path: ./stdout.log

Training Result

Due to the calculation cast a long time, we choose to run the full data, and the NPU-ACC is aligned with the GPU-ACC (as many epochs as possible).

Device	FPS	Top1-ACC 10-view	Batch Size	Epochs	AMP
1P-GPU	10.39	6.67%	96	1/300	O2-128.0
1P-NPU	5.38	6.18%	96	1/300	O2-128.0
1P-NPU-白名单	5.35	6.36%	96	1/300	O2-128.0
8P-GPU	1137.49	37.56%	256	30/300	O2-128.0
8P-NPU	529.24	39.67%	256	30/300	O2-128.0
8P-NPU-fusedSGD	510.66	5.80%	256	2/300	O2-128.0

Testing result: Top1-ACC of 8P-NPU and 8P-GPU training (30 epochs)

![img](meta\8P-GPU & 8P-NPU.png)

Performance Optimization

According to the above, it can be concluded that the accuracy(Top1-ACC 10-view) of 8P-GPU and 8P-NPU is little different. But performance(FPS) of 8P-NPU is 50% of 8P-GPU's.

So we made the following analysis and improvement:

find the dynamic operators following here, but the operators is very basic, and we can not identify them from our big model.

check the profile of NPU through chrome tracing
In order to improve the low perfomance of Transpose, we first generate the cann profiling following here, then we extract the two operators, TransposeD and TransData.
- if TransposeD Consuming time > 10s, add its Input Shapes to White List (/usr/local/Ascend/ascend-toolkit/5.0.2/x86_64-linux/opp/op_impl/built-in/ai_core/tbe/impl/dynamic/transpose.py）
- if TransData Consuming time > 10s & Input Formats == 'NCHW' & Output Formats == 'NC1HWC0', add its Input Shapes to White List (/usr/local/Ascend/ascend-toolkit/5.0.2/x86_64-linux/opp/op_impl/built-in/ai_core/tbe/impl/four_2_five.py）
- if TransData Consuming time > 10s & Input Formats == 'NC1HWC0' & Output Formats == 'NCHW', add its Input Shapes to White List (/usr/local/Ascend/ascend-toolkit/5.0.2/x86_64-linux/opp/op_impl/built-in/ai_core/tbe/impl/five_2_four.py）

After Optimization

ELSE

Iessues and PRs about this project

invalid gradient https://gitcode.com/ascend/modelzoo/issues/I452ZB https://gitcode.com/ascend/pytorch-develop/pulls/2438
optimizer error https://gitcode.com/ascend/pytorch-develop/pulls/2438
pyav install on CentOS arm https://gitcode.com/ascend/modelzoo/issues/I48AP3
scikit-learn cannot allocate memory in static TLS https://gitcode.com/ascend/modelzoo/issues/I48QNY

Statement

For details about the public address of the code in this repository, you can get from the file public_address_statement.md