OpenMP: Marking functions to be included in the offloaded code

Question

I'm trying to compile a simple vector addition example daxpy with OpenMP offloading to GPU. When I compile the code with g++ v10.2.0 I see the following error:

daxpy_loop_target.cpp:37:6: error: function ‘plasma_core_omp_daxpy_loop_target’ \
    has been referenced in offloaded code \
    but hasn’t been marked to be included in the offloaded code

It seems I need to mark that this function is going to be offloaded to GPU. How do I do that?

Please find my C++ source code daxpy_loop_target.cpp below:

#include <omp.h>
#include <cstdio>

#include "daxpy_core.hpp"
#include "daxpy_loop_target.hpp"

void plasma_daxpy_loop_target(int n, int nb, double da, double *dx, int incx, double *dy, int incy) {

    plasma_omp_daxpy_loop_target(n, nb, da, dx, incx, dy, incy);
}

void plasma_omp_daxpy_loop_target(int n, int nb, double da, double *dx, int incx, double *dy, int incy) {

    #pragma omp target
    for (int i = 0; i < n; i += nb) {

        plasma_core_omp_daxpy_loop_target(nb, da, &dx[i], incx, &dy[i], incy);
    }
}

void plasma_core_omp_daxpy_loop_target(int n, double da, double *dx, int incx, double *dy, int incy) {

    plasma_core_daxpy(n, da, dx, incx, dy, incy);
}

Function plasma_core_daxpy is defined in daxpy_core.cpp as:

#include "mkl_cblas.h"
#include "daxpy_core.hpp"

void plasma_core_daxpy(int n, double da, double *dx, int incx, double *dy, int incy) {

    cblas_daxpy(n, da, dx, incx, dy, incy);
}

My compiler and compiler flags are the following:

CXX      = g++
CXXFLAGS =-fopenmp -foffload="-lm -latomic" -m64 ${INC}

Finally, the make output is the following:

g++ -fopenmp -foffload="-lm -latomic" -m64 -I"/lustre/system/local/apps/intel/oneapi/2022.2.0/mkl/latest/include" -c daxpy_util.cpp -o daxpy_util.o
g++ -fopenmp -foffload="-lm -latomic" -m64 -I"/lustre/system/local/apps/intel/oneapi/2022.2.0/mkl/latest/include" -c daxpy_core.cpp -o daxpy_core.o
g++ -fopenmp -foffload="-lm -latomic" -m64 -I"/lustre/system/local/apps/intel/oneapi/2022.2.0/mkl/latest/include" -c daxpy_loop_target.cpp -o daxpy_loop_target.o
g++ -fopenmp -foffload="-lm -latomic" -m64 -I"/lustre/system/local/apps/intel/oneapi/2022.2.0/mkl/latest/include" -c test_daxpy_loop_target.cpp -o test_daxpy_loop_target.o
g++ -fopenmp -foffload="-lm -latomic" -m64 -I"/lustre/system/local/apps/intel/oneapi/2022.2.0/mkl/latest/include" daxpy_util.o daxpy_core.o daxpy_loop_target.o test_daxpy_loop_target.o -Wl,-rpath,/lustre/system/local/apps/gcc10/10.2.0/bin -Wl,-rpath,/lustre/system/local/apps/gcc10/10.2.0/bin -Wl,--start-group /lustre/system/local/apps/intel/oneapi/2022.2.0/mkl/latest/lib/intel64/libmkl_intel_lp64.a /lustre/system/local/apps/intel/oneapi/2022.2.0/mkl/latest/lib/intel64/libmkl_sequential.a /lustre/system/local/apps/intel/oneapi/2022.2.0/mkl/latest/lib/intel64/libmkl_core.a -Wl,--end-group -lpthread -lm -ldl -o test_daxpy_loop_target
daxpy_loop_target.cpp:37:6: error: function ‘plasma_core_omp_daxpy_loop_target’ has been referenced in offloaded code but hasn’t been marked to be included in the offloaded code
   37 | void plasma_core_omp_daxpy_loop_target(int n, double da, double *dx, int incx, double *dy, int incy) {
      |      ^
lto1: fatal error: errors during merging of translation units
compilation terminated.
mkoffload: fatal error: x86_64-pc-linux-gnu-accel-nvptx-none-gcc returned 1 exit status
compilation terminated.
lto-wrapper: fatal error: /netfs/smain01/system/local/apps/gcc10/10.2.0/bin/../libexec/gcc/x86_64-pc-linux-gnu/10.2.0//accel/nvptx-none/mkoffload returned 1 exit status
compilation terminated.
/lustre/system/local/apps/gcc/utilities/bin/ld: error: lto-wrapper failed
collect2: error: ld returned 1 exit status
make: *** [test_daxpy_loop_target] Error 1

Any hints or pointers would be greatly appreciated!

Update A

Other compilers provide less helpful error messages, e.g. clang++ v10.0.0 or nvc++ v21.3-0 complain about the undefined reference to plasma_core_daxpy:

clang++ -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Xopenmp-target -march=sm_70 -std=c++17 -m64 -I"/lustre/system/local/apps/intel/oneapi/2022.2.0/mkl/latest/include" daxpy_util.o daxpy_core.o daxpy_loop_target.o test_daxpy_loop_target.o -Wl,-rpath,/lustre/system/local/apps/gcc10/10.2.0/bin -Wl,-rpath,/lustre/system/local/apps/gcc10/10.2.0/bin -Wl,--start-group /lustre/system/local/apps/intel/oneapi/2022.2.0/mkl/latest/lib/intel64/libmkl_intel_lp64.a /lustre/system/local/apps/intel/oneapi/2022.2.0/mkl/latest/lib/intel64/libmkl_sequential.a /lustre/system/local/apps/intel/oneapi/2022.2.0/mkl/latest/lib/intel64/libmkl_core.a -Wl,--end-group -lpthread -lm -ldl -o test_daxpy_loop_target
nvlink error   : Undefined reference to '_Z17plasma_core_daxpyidPdiS_i' in '/tmp/daxpy_loop_target-df645f.cubin'
clang-10: error: nvlink command failed with exit code 255 (use -v to see invocation)

nvc++ -mp=gpu -gpu=managed -m64 -I"/lustre/system/local/apps/intel/oneapi/2022.2.0/mkl/latest/include" daxpy_util.o daxpy_core.o daxpy_loop_target.o test_daxpy_loop_target.o -Wl,-rpath,/lustre/system/local/apps/gcc10/10.2.0/bin -Wl,-rpath,/lustre/system/local/apps/gcc10/10.2.0/bin -Wl,--start-group /lustre/system/local/apps/intel/oneapi/2022.2.0/mkl/latest/lib/intel64/libmkl_intel_lp64.a /lustre/system/local/apps/intel/oneapi/2022.2.0/mkl/latest/lib/intel64/libmkl_sequential.a /lustre/system/local/apps/intel/oneapi/2022.2.0/mkl/latest/lib/intel64/libmkl_core.a -Wl,--end-group -lpthread -lm -ldl -o test_daxpy_loop_target
nvlink error   : Undefined reference to '_Z17plasma_core_daxpyidPdiS_i' in 'daxpy_loop_target.o'
pgacclnk: child process exit status 2: /netfs/smain01/system/local/apps/nvidia-sdk/21.3/Linux_x86_64/21.3/compilers/bin/tools/nvdd
make: *** [test_daxpy_loop_target] Error 2

Update B

Assume I would like to implement my own version of daxpy for the GPU. I'm trying to declare it as a function to be offloaded with:

#pragma omp begin declare target
extern void plasma_core_daxpy(int, double, double*, int, double*, int);
#pragma omp end declare target

However, none of the compilers I use g++, nvc++ and clang++ understand this declaration. These compilers fail with the errors similar to (clang++ v10.0.0):

daxpy_loop_target.cpp:16:13: error: expected an OpenMP directive
#pragma omp begin declare target
        ^
daxpy_loop_target.cpp:18:25: error: unexpected OpenMP directive '#pragma omp end declare target'
#pragma omp end declare target

Can you please advise what compiler versions support the #pragma omp begin|end declare target sentinels?

If you want to call a function from offloaded code, you need to have `#pragma omp declare target` to make the compiler generate a device version of that function. But since you are calling a BLAS function, you will need to rewrite the code to use cuBLAS instead of the `cblas_daxpy`. — Michael Klemm, Aug 24 '22 at 15:43
I see. If possible, can you please give a short example? I'm especially interested in the use of the `#pragma omp declare target`. — mabalenk, Aug 24 '22 at 15:57
See the OpenMP documentation: https://www.openmp.org/wp-content/uploads/openmp-examples-5.1.pdf . Note that a GPU code cannot call CBLAS functions because it is meant to be executed on the CPU and the code is implemented in a separate library (so OpenMP cannot see the implementation and map it to a GPU). Even if it could, it would not be efficient anyway (CPU and GPU are pretty different so different algorithms are required to produce fast programs). Vendors like Nvidia provide GPU-oriented BLAS for that (see CuBLAS). For a generic code, you need to reimplement this. — Jérôme Richard, Aug 24 '22 at 19:52
The documentation states it is "*introduced in OpenMP 5.1*" which is pretty recent (and most compilers tends to slowly implement new feature, see https://releases.llvm.org/14.0.0/tools/clang/docs/OpenMPSupport.html#id3 for Clang 14.0.0 which mostly do not support it). In fact, up to date mainstream compilers barely support OpenMP 5.0 AFAIK (especially the `target`-based part). — Jérôme Richard, Aug 25 '22 at 18:48

OpenMP: Marking functions to be included in the offloaded code

0 Answers0

Linked