[NVIDIA] Replace some NVGPU ops with equivalent NVVM ops (part 2) (#7471)
This change replaces some NVGPU ops with the corresponding NVVM ops. It
aligns with previous discussions in PR #7420.
For some op like NVGPU::FenceAsyncSharedOp, there is no corresponding
Intrinsic, and LLVM will also generate PTX. However, in the long run, I
think it is better to hand over the responsibility of generating code to
LLVM instead of hard coding PTX at the NVGPU layer.
The ConvertNVVMToLLVMPass has been added to the pipeline and build
system so that NVVM ops are correctly lowered to LLVM IR.