| [feature]Fix the problem of slow parsing speed of JSON files in big datasets
Co-authored-by: feng0w0<houyufeng4@huawei.com>
# message auto-generated for no-merge-commit merge:
!1944 merge master into master
[feature]Fix the problem of slow parsing speed of JSON files in big datasets
Created-by: feng0w0
Commit-by: feng0w0
Merged-by: ascend-robot
Description: ## Motivation
1.When processing large datasets, the speed of JSON parsing is very slow.
2.The parsed JSON data contains some keys that were not used during the training process
## Modification
1.Replace the parsing library used for parsing a single JSON file from pandas to orjson.
2.Using multiple processes to accelerate the processing of multiple JSON files and utilizing shared memory to reduce data transfer time between processes.(Set in data.json: dataset_param.basic_parameters.use_multiprocess)
3.During the parsing of JSON files, only the specified keys are retained.(Set in data.json: dataset_param.basic_parameters.reserved_keys)
## Self-test (Optional)
If modifications to this PR may cause/fix function/accuracy/performance DTSs/issues, a self-inspection record needs to be attached.
## BC-breaking (Optional)
If there are compatibility issues, such as dependencies on cann/torch_npu versions, they need to be explained in the PR.
## Checklist
**Before PR**:
- [ ] The new code needs to comply with the Clean Code specification.
- [ ] The PR content is self-checked, and the expression can be clear and the writing standardized
**After PR**:
- [ ] CLA has been signed and all committers have signed the CLA in this PR.
- [ ] The ci-pipeline is passed, Code Check is passed.
See merge request: Ascend/MindSpeed-MM!1944 | 4 个月前 |