Transformed generic-mode kernel to SPMD-mode [OMP120]

This optimization remark indicates that the execution strategy for the OpenMP target offloading kernel was changed. Generic-mode kernels are executed by a single thread that schedules parallel worker threads using a state machine. This code transformation can move a kernel that was initially generated in generic mode to SPMD-mode where all threads are active at the same time with no state machine. This execution strategy is closer to how the threads are actually executed on a GPU target. This is only possible if the instructions previously executed by a single thread have no side-effects or can be guarded. If the instructions have no side-effects they are simply recomputed by each thread.

Generic-mode is often considerably slower than SPMD-mode because of the extra overhead required to separately schedule worker threads and pass data between them.This optimization allows users to use generic-mode semantics while achieving the performance of SPMD-mode. This can be helpful when defining shared memory between the threads using :ref:OMP111 <omp111>.

Examples

Normally, any kernel that contains split OpenMP target and parallel regions will be executed in generic-mode. Sometimes it is easier to use generic-mode semantics to define shared memory, or more tightly control the distribution of the threads. This shows a naive matrix-matrix multiplication that contains code that will need to be guarded.

.. code-block:: c++

#pragma omp parallel for reduction(+:sum) default(firstprivate) for (int k = 0; k < K; k++) sum += A[iK + k] * B[kN + j];

    C[i*N + j] = sum;
  }
}

}

.. code-block:: console

$ clang++ -fopenmp -fopenmp-targets=nvptx64 -fopenmp-version=51 -O2 -Rpass=openmp-opt omp120.cpp omp120.cpp:6:14: remark: Replaced globalized variable with 8 bytes of shared memory. [OMP111] double sum = 0.0; ^ omp120.cpp:2:1: remark: Transformed generic-mode kernel to SPMD-mode. [OMP120] #pragma omp target teams distribute collapse(2)
^

This requires guarding the store to the shared variable sum and the store to the matrix C. This can be thought of as generating the code below.

.. code-block:: c++

void matmul(int M, int N, int K, double A, double B, double C) { #pragma omp target teams distribute collapse(2)
map(to:A[0: MK]) map(to:B[0: KN]) map(tofrom:C[0 : MN]) for (int i = 0; i < M; i++) { for (int j = 0; j < N; j++) { double sum; #pragma omp parallel default(firstprivate) shared(sum) { #pragma omp barrier if (omp_get_thread_num() == 0) sum = 0.0; #pragma omp barrier

#pragma omp for reduction(+:sum) for (int k = 0; k < K; k++) sum += A[iK + k] * B[kN + j];

  #pragma omp barrier
  if (omp_get_thread_num() == 0)
    C[i*N + j] = sum;
  #pragma omp barrier
  }
  }
}

}

Diagnostic Scope

OpenMP target offloading optimization remark.