jihuazhong[PyTorch][dev][perf][mlperf-bert]修复run_pretraing.py中的异常字符

4b1a3324创建于 2023年6月30日历史提交

文件	最后提交记录	最后更新时间
cleanup_scripts	!4675 mlperf_bert源码移动到dev/perf/mlperf_bert/pytorch目录下 * mlperf_bert源码移动到mlperf_bert/pytorch目录下并回退新增的NVIDIA	3 年前
input_preprocessing	!4675 mlperf_bert源码移动到dev/perf/mlperf_bert/pytorch目录下 * mlperf_bert源码移动到mlperf_bert/pytorch目录下并回退新增的NVIDIA	3 年前
mhalib	!4675 mlperf_bert源码移动到dev/perf/mlperf_bert/pytorch目录下 * mlperf_bert源码移动到mlperf_bert/pytorch目录下并回退新增的NVIDIA	3 年前
model	!4689 [PyTorch][dev][perf]增加mlperf_bert适配代码 * 增加mlperf_bert适配代码	3 年前
.dockerignore	!4675 mlperf_bert源码移动到dev/perf/mlperf_bert/pytorch目录下 * mlperf_bert源码移动到mlperf_bert/pytorch目录下并回退新增的NVIDIA	3 年前
Dockerfile	!4675 mlperf_bert源码移动到dev/perf/mlperf_bert/pytorch目录下 * mlperf_bert源码移动到mlperf_bert/pytorch目录下并回退新增的NVIDIA	3 年前
LICENSE	!4675 mlperf_bert源码移动到dev/perf/mlperf_bert/pytorch目录下 * mlperf_bert源码移动到mlperf_bert/pytorch目录下并回退新增的NVIDIA	3 年前
NOTICE	!4675 mlperf_bert源码移动到dev/perf/mlperf_bert/pytorch目录下 * mlperf_bert源码移动到mlperf_bert/pytorch目录下并回退新增的NVIDIA	3 年前
README.md	!4675 mlperf_bert源码移动到dev/perf/mlperf_bert/pytorch目录下 * mlperf_bert源码移动到mlperf_bert/pytorch目录下并回退新增的NVIDIA	3 年前
bmm1.py	!4675 mlperf_bert源码移动到dev/perf/mlperf_bert/pytorch目录下 * mlperf_bert源码移动到mlperf_bert/pytorch目录下并回退新增的NVIDIA	3 年前
bmm2.py	!4675 mlperf_bert源码移动到dev/perf/mlperf_bert/pytorch目录下 * mlperf_bert源码移动到mlperf_bert/pytorch目录下并回退新增的NVIDIA	3 年前
config_DGXA100_16x8x24x1.sh	!4675 mlperf_bert源码移动到dev/perf/mlperf_bert/pytorch目录下 * mlperf_bert源码移动到mlperf_bert/pytorch目录下并回退新增的NVIDIA	3 年前
config_DGXA100_16x8x64x1.sh	!4675 mlperf_bert源码移动到dev/perf/mlperf_bert/pytorch目录下 * mlperf_bert源码移动到mlperf_bert/pytorch目录下并回退新增的NVIDIA	3 年前
config_DGXA100_2x8x16x1.sh	!4675 mlperf_bert源码移动到dev/perf/mlperf_bert/pytorch目录下 * mlperf_bert源码移动到mlperf_bert/pytorch目录下并回退新增的NVIDIA	3 年前
config_DGXA100_2x8x28x1.sh	!4675 mlperf_bert源码移动到dev/perf/mlperf_bert/pytorch目录下 * mlperf_bert源码移动到mlperf_bert/pytorch目录下并回退新增的NVIDIA	3 年前
config_DGXA100_2x8x48x1.sh	!4675 mlperf_bert源码移动到dev/perf/mlperf_bert/pytorch目录下 * mlperf_bert源码移动到mlperf_bert/pytorch目录下并回退新增的NVIDIA	3 年前
config_DGXA100_common.sh	!4675 mlperf_bert源码移动到dev/perf/mlperf_bert/pytorch目录下 * mlperf_bert源码移动到mlperf_bert/pytorch目录下并回退新增的NVIDIA	3 年前
convert_tf_checkpoint.py	!4675 mlperf_bert源码移动到dev/perf/mlperf_bert/pytorch目录下 * mlperf_bert源码移动到mlperf_bert/pytorch目录下并回退新增的NVIDIA	3 年前
create_pretraining_data.py	!4675 mlperf_bert源码移动到dev/perf/mlperf_bert/pytorch目录下 * mlperf_bert源码移动到mlperf_bert/pytorch目录下并回退新增的NVIDIA	3 年前
extract_features.py	!4675 mlperf_bert源码移动到dev/perf/mlperf_bert/pytorch目录下 * mlperf_bert源码移动到mlperf_bert/pytorch目录下并回退新增的NVIDIA	3 年前
file_utils.py	!4675 mlperf_bert源码移动到dev/perf/mlperf_bert/pytorch目录下 * mlperf_bert源码移动到mlperf_bert/pytorch目录下并回退新增的NVIDIA	3 年前
mha.py	!4675 mlperf_bert源码移动到dev/perf/mlperf_bert/pytorch目录下 * mlperf_bert源码移动到mlperf_bert/pytorch目录下并回退新增的NVIDIA	3 年前
mlperf_logger.py	!4675 mlperf_bert源码移动到dev/perf/mlperf_bert/pytorch目录下 * mlperf_bert源码移动到mlperf_bert/pytorch目录下并回退新增的NVIDIA	3 年前
modeling.py	!4675 mlperf_bert源码移动到dev/perf/mlperf_bert/pytorch目录下 * mlperf_bert源码移动到mlperf_bert/pytorch目录下并回退新增的NVIDIA	3 年前
optimization.py	!4675 mlperf_bert源码移动到dev/perf/mlperf_bert/pytorch目录下 * mlperf_bert源码移动到mlperf_bert/pytorch目录下并回退新增的NVIDIA	3 年前
padding.py	!4675 mlperf_bert源码移动到dev/perf/mlperf_bert/pytorch目录下 * mlperf_bert源码移动到mlperf_bert/pytorch目录下并回退新增的NVIDIA	3 年前
requirements.txt	!4675 mlperf_bert源码移动到dev/perf/mlperf_bert/pytorch目录下 * mlperf_bert源码移动到mlperf_bert/pytorch目录下并回退新增的NVIDIA	3 年前
run.sub	!4675 mlperf_bert源码移动到dev/perf/mlperf_bert/pytorch目录下 * mlperf_bert源码移动到mlperf_bert/pytorch目录下并回退新增的NVIDIA	3 年前
run_and_time.sh	!4675 mlperf_bert源码移动到dev/perf/mlperf_bert/pytorch目录下 * mlperf_bert源码移动到mlperf_bert/pytorch目录下并回退新增的NVIDIA	3 年前
run_pretraining.py	[PyTorch][dev][perf][mlperf-bert]修复run_pretraing.py中的异常字符	2 年前
schedulers.py	!4675 mlperf_bert源码移动到dev/perf/mlperf_bert/pytorch目录下 * mlperf_bert源码移动到mlperf_bert/pytorch目录下并回退新增的NVIDIA	3 年前
softmax.py	!4675 mlperf_bert源码移动到dev/perf/mlperf_bert/pytorch目录下 * mlperf_bert源码移动到mlperf_bert/pytorch目录下并回退新增的NVIDIA	3 年前
tokenization.py	!4675 mlperf_bert源码移动到dev/perf/mlperf_bert/pytorch目录下 * mlperf_bert源码移动到mlperf_bert/pytorch目录下并回退新增的NVIDIA	3 年前
train_1p_perf.sh	!4675 mlperf_bert源码移动到dev/perf/mlperf_bert/pytorch目录下 * mlperf_bert源码移动到mlperf_bert/pytorch目录下并回退新增的NVIDIA	3 年前
train_8p_acc.sh	!4675 mlperf_bert源码移动到dev/perf/mlperf_bert/pytorch目录下 * mlperf_bert源码移动到mlperf_bert/pytorch目录下并回退新增的NVIDIA	3 年前
train_8p_perf.sh	!4675 mlperf_bert源码移动到dev/perf/mlperf_bert/pytorch目录下 * mlperf_bert源码移动到mlperf_bert/pytorch目录下并回退新增的NVIDIA	3 年前
utils.py	!4675 mlperf_bert源码移动到dev/perf/mlperf_bert/pytorch目录下 * mlperf_bert源码移动到mlperf_bert/pytorch目录下并回退新增的NVIDIA	3 年前

Location of the input files

This MLCommons members Google Drive location contains the following.

TensorFlow checkpoint (bert_model.ckpt) containing the pre-trained weights (which is actually 3 files).
Vocab file (vocab.txt) to map WordPiece to word id.
Config file (bert_config.json) which specifies the hyperparameters of the model.

Checkpoint conversion

python convert_tf_checkpoint.py --tf_checkpoint /cks/model.ckpt-28252 --bert_config_path /cks/bert_config.json --output_checkpoint model.ckpt-28252.pt

Download the preprocessed text dataset

From the MLCommons BERT Processed dataset directory download results_text.tar.gz, and bert_reference_results_text_md5.txt. Then perform the following steps:

tar xf results_text.tar.gz
cd results4
md5sum --check ../bert_reference_results_text_md5.txt
cd ..

Generate the BERT input dataset

The create_pretraining_data.py script duplicates the input plain text, replaces different sets of words with masks for each duplication, and serializes the output into the HDF5 file format.

Training data

The following shows how create_pretraining_data.py is called by a parallelized script that can be called as shown below. The script reads the text data from the results4/ subdirectory and outputs the resulting 500 hdf5 files to a subdirectory named "hdf5".

./parallel_create_hdf5.sh

Next we need to shard the data into 2048 chunks. This is done by calling the chop_hdf5_files.py script. This script reads the 500 hdf5 files from subdirectory hdf5/ and creates 2048 hdf5 files in subdirectory 2048_shards_uncompressed.

mkdir -p 2048_shards_uncompressed
python3 ./chop_hdf5_files.py

Evaluation data

Use the following steps for the eval set:

mkdir eval_set_uncompressed

python3 create_pretraining_data.py \
  --input_file=results4/eval.txt \
  --output_file=eval_all \
  --vocab_file=<path to vocab.txt> \
  --do_lower_case=True \
  --max_seq_length=512 \
  --max_predictions_per_seq=76 \
  --masked_lm_prob=0.15 \
  --random_seed=12345 \
  --dupe_factor=10

python3 pick_eval_samples.py \
  --input_hdf5_file=eval_all.hdf5 \
  --output_hdf5_file=eval_set_uncompressed/part_eval_10k.hdf5 \
  --num_examples_to_pick=10000

Running the model

Building the Docker container

docker build --pull -t <docker/registry>/mlperf-nvidia:language_model .
docker push <docker/registry>/mlperf-nvidia:language_model

To run this model, use the following command. Replace the paths in the config file(config_DGXA100_common.sh) to match paths on your system.

source <CONFIG>
sbatch --nodes ${DGXNNODES} --ntasks-per-node=${DGXNGPU} --time=${WALLTIME} run.sub

To run RCP specific configs:

GBS256=TPU-v4-16

source config_DGXA100_2x8x16x1.sh;
sbatch --nodes 2 --ntasks-per-node 8 --time 01:00:00 run.sub

GBS448=TPU-v4-16

source config_DGXA100_2x8x28x1.sh;
sbatch --nodes 2 --ntasks-per-node 8 --time 01:00:00 run.sub

GBS768=TPU-v4-16

source config_DGXA100_2x8x48x1.sh;
sbatch --nodes 2 --ntasks-per-node 8 --time 01:00:00 run.sub

GBS3072=TPU-v4-128

source config_DGXA100_16x8x24x1.sh;
sbatch --nodes 16 --ntasks-per-node 8 --time 01:00:00 run.sub

GBS8192=TPU-v4-128

source config_DGXA100_16x8x64x1.sh;
sbatch --nodes 16 --ntasks-per-node 8 --time 01:00:00 run.sub

Note: Make sure a current run doesn't use stale parameters sourced from a previous run.

For multi-node training, we use Slurm for scheduling and Pyxis to run our container.

The code comes with an option to either clip and accumulate gradients (--clip_and_accumulate), or accumulate gradients and then clip(which is the default mode). The benchmark implements gradient clipping before allreduce. To be identical to a 16x8x64x1 run with gradient accumulation for example, one can run either:

8x8x64x2 with --clip_and_accumulate
16x8x32x2 without --clip_and_accumulate

Configuration File Naming Convention

All configuration files follow the format config_<SYSTEM_NAME>_<NODES>x<GPUS/NODE>x<BATCH/GPU>x<GRADIENT_ACCUMULATION_STEPS>.sh.

Example 1

A DGX A100 system with 1 node, 8 GPUs per node, batch size of 6 per GPU, and 6 gradient accumulation steps would use config_DGXA100_1x8x6x6.sh.

Example 2

A DGX A100 system with 32 nodes, 8 GPUs per node, batch size of 20 per GPU, and no gradient accumulation would use config_DGXA100_32x8x20x1.sh

Description of how the `results_text.tar.gz` file was prepared

First download the wikipedia dump and extract the pages The wikipedia dump can be downloaded from this google drive, and should contain enwiki-20200101-pages-articles-multistream.xml.bz2 as well as the md5sum.
Run WikiExtractor.py, version e4abb4cb from March 29, 2020, to extract the wiki pages from the XML The generated wiki pages file will be stored as /LL/wiki_nn; for example /AA/wiki_00. Each file is ~1MB, and each sub directory has 100 files from wiki_00 to wiki_99, except the last sub directory. For the 20200101 dump, the last file is FE/wiki_17.
Clean up and dataset seperation. The clean up scripts (some references here) are in the scripts directory. The following command will run the clean up steps, and put the resulted trainingg and eval data in ./results ./process_wiki.sh 'text/*/wiki_??'
After running the process_wiki.sh script, for the 20200101 wiki dump, there will be 500 files, named part-00xxx-of-00500 in the ./results directory, together with eval.md5 and eval.txt.
Exact steps (starting in the bert path)

cd input_preprocessing
mkdir -p wiki  
cd wiki
# download enwiki-20200101-pages-articles-multistream.xml.bz2 from Google drive and check md5sum
bzip2 -d enwiki-20200101-pages-articles-multistream.xml.bz2  
cd ..    # back to bert/input_preprocessing  
git clone https://github.com/attardi/wikiextractor.git
cd wikiextractor
git checkout e4abb4cbd
python3 wikiextractor/WikiExtractor.py wiki/enwiki-20200101-pages-articles-multistream.xml    # Results are placed in bert/input_preprocessing/text  
./process_wiki.sh './text/*/wiki_??'

After running the process_wiki.sh script, for the 20200101 wiki dump, there will be 500 files, named part-00xxx-of-00500 in the ./results directory.

MD5sums:

File	Size (bytes)	MD5
bert_config.json	314	7f59165e21b7d566db610ff6756c926b
vocab.txt	231,508	64800d5d8528ce344256daf115d4965e
model.ckpt-28252.index (tf1)	17,371	f97de3ae180eb8d479555c939d50d048
model.ckpt-28252.meta (tf1)	24,740,228	dbd16c731e8a8113bc08eeed0326b8e7
model.ckpt-28252.data-00000-of-00001 (tf1)	4,034,713,312	50797acd537880bfb5a7ade80d976129
model.ckpt-28252.index (tf2)	6,420	fc34dd7a54afc07f2d8e9d64471dc672
model.ckpt-28252.data-00000-of-00001 (tf2)	1,344,982,997	77d642b721cf590c740c762c7f476e04
enwiki-20200101-pages-articles-multistream.xml.bz2	17,751,214,669	00d47075e0f583fb7c0791fac1c57cb3
enwiki-20200101-pages-articles-multistream.xml	75,163,254,305	1021bd606cba24ffc4b93239f5a09c02