1

I installed both gcc-7, gcc-8, gcc-7-offload-nvptx and gcc-8-offload-nvptx

I tried with both to compile a simple OpenMP code with offloading:

#include <omp.h>
#include <stdio.h>

int main(){
    #pragma omp target
    #pragma omp teams distribute parallel for
    for (int i=0; i<omp_get_num_threads(); i++)
        printf("%d in %d of %d\n",i,omp_get_thread_num(), omp_get_num_threads());
}

With the following line (with gcc-7 too):

gcc-8 code.c -fopenmp -foffload=nvptx-none

But it doesn't compile, giving the following error:

/tmp/ccKESWcF.o: In function "main":
teste.c:(.text+0x50): undefined reference to "GOMP_target_ext"
/tmp/cc0iOH1Y.target.o: In function "init":
ccPXyu6Y.c:(.text+0x1d): undefined reference to "GOMP_offload_register_ver"
/tmp/cc0iOH1Y.target.o: In function "fini":
ccPXyu6Y.c:(.text+0x41): undefined reference to "GOMP_offload_unregister_ver"
collect2: error: ld returned 1 exit status

some clues?

648trindade
  • 689
  • 2
  • 5
  • 21
  • The code compiles fine in a clang distribution that I have – 648trindade Mar 15 '18 at 21:31
  • It seems that libgomp.so is too old. – Ilya Verbin Mar 16 '18 at 00:27
  • You code compiles and runs on the CPU but fails on the GPU for me. Are you sure your Clang solution is actually running on the GPU? I suspect it's actually running on the CPU. Have you looked at `sudo nvprof ./a.out` (assuming Nvidia) is actually being used? – Z boson Mar 16 '18 at 09:26
  • Yes, I looked. 1024 threads are generated and executed from gpu, according to nvprof. nvprof results: https://pastebin.com/05Nv3m7v – 648trindade Mar 16 '18 at 17:41
  • @648trindade I don't understand how `printf` would work on the GPU. – Z boson Mar 19 '18 at 08:03

1 Answers1

0

You code compiles and runs for me using -foffload=disable -fno-stack-protector with gcc7 and gcc-7-offload-nvptx and Ubuntu 17.10.

But on the GPU (without -foffload=disable) it fails to compile. You can't call printf from the GPU. Instead you can do this:

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>

int main(){
  int nthreads;
  #pragma omp target teams map(tofrom:nthreads)
  #pragma omp parallel
  #pragma omp single
  nthreads = omp_get_num_threads();

  int *ithreads = malloc(sizeof *ithreads *nthreads);

  #pragma omp target teams distribute parallel for map(tofrom:ithreads[0:nthreads])
  for (int i=0; i<nthreads; i++) ithreads[i] = omp_get_thread_num();

  for (int i=0; i<nthreads; i++)
    printf("%d in %d of %d\n", i, ithreads[i], nthreads);

  free(ithreads);  
}

For me this outputs

0 in 0 of 8
1 in 0 of 8
2 in 0 of 8
3 in 0 of 8
4 in 0 of 8
5 in 0 of 8
6 in 0 of 8
7 in 0 of 8
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • This code doesn't compile too on gcc-7 or gcc-8. But compiles on clang :/ – 648trindade Mar 16 '18 at 17:47
  • However, if I set `-foffload=disable` I got the following error https://pastebin.com/aSV8eSa5 So, I suspected that my libgomp is outdated, like @ilya-verbin said. But `dpkg -l | grep libgomp` shows this version https://pastebin.com/5WRzJcMS – 648trindade Mar 16 '18 at 17:51
  • 1
    Please check that `ldd a.out` points to libgomp.so version 8 (or 7 when you use gcc-7). – Ilya Verbin Mar 16 '18 at 22:33
  • 1
    @648trindade, OpenMP reports 8 threadds in the target region on the GPU. I know that sounds strange. The CPU also reports 8 threads. `simd` makes a big difference on the GPU so I suspect the threads in OpenMP target regions are not SIMT threads but more like a warp. I have a GTX 1060. It has 20 SM units each which have 4 "cores" and each of those area warp so 32 SIMT threads -> 20*4*32 max. OpenMP defaults to 30 teams of 8 threads on my system. – Z boson Mar 19 '18 at 07:50
  • @Zboson I'm testing on a Titan X GPU, and clang implementation returns 992 threads with `omp_get_num_threads` inside a target and parallel region. If I call `omp_get_num_threads` inside of parallel for, and set `teams distribute`directive, the number of threads returned is 1024 – 648trindade Mar 21 '18 at 16:51
  • @IlyaVerbin I cannot run `ldd` because `gcc` doesn't even compile the code. `clang` implementation links `libomp.so.5` (from `libomp5`package) and the compiled `libomptarget.so` – 648trindade Mar 21 '18 at 16:58
  • 1
    And what is in `$LD_LIBRARY_PATH`? – Ilya Verbin Mar 21 '18 at 17:53
  • @IlyaVerbin Actually there was the clang's `lib` dir on `$LD_LIBRARY_PATH` (-‸ლ). I removed and all worked fine. Sorry for my dumbness. I should remove the question? – 648trindade Mar 28 '18 at 04:14
  • @Zboson now gcc reports 8 threads for me too. Titan X has 28 SM units, 128 "cores" per SM, reporting 3584 cuda cores. – 648trindade Mar 28 '18 at 04:47
  • gcc apparently has an upper bound of 72 teams (if I set more, omp defaults to 72) and `omp_get_max_threads` returns 8 inside a target region – 648trindade Mar 28 '18 at 05:14
  • 1
    @648trindade, I think it might be a bug in `omp_get_num_threads`. It's possibly not ready for `target`. In terms of speed for the few tests I have done OMP on the GPU is competitive with OpenACC so the bug is asking the number of threads. – Z boson Apr 09 '18 at 08:03