Problems with gcc 7 and 8 (debian) in OpenMP offloading to nvptx

Question

I installed both gcc-7, gcc-8, gcc-7-offload-nvptx and gcc-8-offload-nvptx

I tried with both to compile a simple OpenMP code with offloading:

#include <omp.h>
#include <stdio.h>

int main(){
    #pragma omp target
    #pragma omp teams distribute parallel for
    for (int i=0; i<omp_get_num_threads(); i++)
        printf("%d in %d of %d\n",i,omp_get_thread_num(), omp_get_num_threads());
}

With the following line (with gcc-7 too):

gcc-8 code.c -fopenmp -foffload=nvptx-none

But it doesn't compile, giving the following error:

/tmp/ccKESWcF.o: In function "main":
teste.c:(.text+0x50): undefined reference to "GOMP_target_ext"
/tmp/cc0iOH1Y.target.o: In function "init":
ccPXyu6Y.c:(.text+0x1d): undefined reference to "GOMP_offload_register_ver"
/tmp/cc0iOH1Y.target.o: In function "fini":
ccPXyu6Y.c:(.text+0x41): undefined reference to "GOMP_offload_unregister_ver"
collect2: error: ld returned 1 exit status

some clues?

You code compiles and runs on the CPU but fails on the GPU for me. Are you sure your Clang solution is actually running on the GPU? I suspect it's actually running on the CPU. Have you looked at `sudo nvprof ./a.out` (assuming Nvidia) is actually being used? — Z boson, Mar 16 '18 at 09:26
Yes, I looked. 1024 threads are generated and executed from gpu, according to nvprof. nvprof results: https://pastebin.com/05Nv3m7v — 648trindade, Mar 16 '18 at 17:41
@648trindade I don't understand how `printf` would work on the GPU. — Z boson, Mar 19 '18 at 08:03

score 0 · Answer 1 · answered Mar 16 '18 at 09:24

0

You code compiles and runs for me using -foffload=disable -fno-stack-protector with gcc7 and gcc-7-offload-nvptx and Ubuntu 17.10.

But on the GPU (without -foffload=disable) it fails to compile. You can't call printf from the GPU. Instead you can do this:

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>

int main(){
  int nthreads;
  #pragma omp target teams map(tofrom:nthreads)
  #pragma omp parallel
  #pragma omp single
  nthreads = omp_get_num_threads();

  int *ithreads = malloc(sizeof *ithreads *nthreads);

  #pragma omp target teams distribute parallel for map(tofrom:ithreads[0:nthreads])
  for (int i=0; i<nthreads; i++) ithreads[i] = omp_get_thread_num();

  for (int i=0; i<nthreads; i++)
    printf("%d in %d of %d\n", i, ithreads[i], nthreads);

  free(ithreads);  
}

For me this outputs

0 in 0 of 8
1 in 0 of 8
2 in 0 of 8
3 in 0 of 8
4 in 0 of 8
5 in 0 of 8
6 in 0 of 8
7 in 0 of 8

answered Mar 16 '18 at 09:24

Z boson

32,619
11
123
226

This code doesn't compile too on gcc-7 or gcc-8. But compiles on clang :/ – 648trindade Mar 16 '18 at 17:47
However, if I set `-foffload=disable` I got the following error https://pastebin.com/aSV8eSa5 So, I suspected that my libgomp is outdated, like @ilya-verbin said. But `dpkg -l | grep libgomp` shows this version https://pastebin.com/5WRzJcMS – 648trindade Mar 16 '18 at 17:51
1

Please check that `ldd a.out` points to libgomp.so version 8 (or 7 when you use gcc-7). – Ilya Verbin Mar 16 '18 at 22:33
1

@648trindade, OpenMP reports 8 threadds in the target region on the GPU. I know that sounds strange. The CPU also reports 8 threads. `simd` makes a big difference on the GPU so I suspect the threads in OpenMP target regions are not SIMT threads but more like a warp. I have a GTX 1060. It has 20 SM units each which have 4 "cores" and each of those area warp so 32 SIMT threads -> 20*4*32 max. OpenMP defaults to 30 teams of 8 threads on my system. – Z boson Mar 19 '18 at 07:50
@Zboson I'm testing on a Titan X GPU, and clang implementation returns 992 threads with `omp_get_num_threads` inside a target and parallel region. If I call `omp_get_num_threads` inside of parallel for, and set `teams distribute`directive, the number of threads returned is 1024 – 648trindade Mar 21 '18 at 16:51
@IlyaVerbin I cannot run `ldd` because `gcc` doesn't even compile the code. `clang` implementation links `libomp.so.5` (from `libomp5`package) and the compiled `libomptarget.so` – 648trindade Mar 21 '18 at 16:58
1

And what is in `$LD_LIBRARY_PATH`? – Ilya Verbin Mar 21 '18 at 17:53
@IlyaVerbin Actually there was the clang's `lib` dir on `$LD_LIBRARY_PATH` (－‸ლ). I removed and all worked fine. Sorry for my dumbness. I should remove the question? – 648trindade Mar 28 '18 at 04:14
@Zboson now gcc reports 8 threads for me too. Titan X has 28 SM units, 128 "cores" per SM, reporting 3584 cuda cores. – 648trindade Mar 28 '18 at 04:47
gcc apparently has an upper bound of 72 teams (if I set more, omp defaults to 72) and `omp_get_max_threads` returns 8 inside a target region – 648trindade Mar 28 '18 at 05:14
1

@648trindade, I think it might be a bug in `omp_get_num_threads`. It's possibly not ready for `target`. In terms of speed for the few tests I have done OMP on the GPU is competitive with OpenACC so the bug is asking the number of threads. – Z boson Apr 09 '18 at 08:03

Problems with gcc 7 and 8 (debian) in OpenMP offloading to nvptx

1 Answers1