0

I am using the modeling toolbox Anuga and have set it up to run with parallel support. To my current knowledge the mechanism behind is that Numpy is being extended by modules in C which are exposed to OpenMP through

extra_args = ['-fopenmp']

I have developed and tested a script to run through mpirun -np 4 python <myscript.py> and it works. Since models are getting bigger my interest is to shift some processing to a GPU in the physical form of a NVIDIA GPU through OpenMP. I read about this being called Offloading. I have installed a Quadro K2000 with

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro K2000        Off  | 00000000:01:00.0  On |                  N/A |
| 32%   48C    P8    N/A /  N/A |    403MiB /  1999MiB |      4%      Default |
+-------------------------------+----------------------+----------------------+

So I

  1. installed gcc-offload-nvptx on my Ubuntu 19.04, which reads version 8 of gcc. I then

  2. altered the compiler flags to

extra_args = ['-fopenmp', '-fstack-protector']

and

  1. compiled the installation through python setup.py build. This returns the following message for the targeted module cg_ext.c without any further error:

x86_64-linux-gnu-gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -Wdate-time -D_FORTIFY_SOURCE=2 -g -fdebug-prefix-map=/build/python2.7-rzpqx3/python2.7-2.7.16=. -fstack-protector-strong -Wformat -Werror=format-security -Wl,-Bsymbolic-functions -Wl,-z,relro -Wdate-time -D_FORTIFY_SOURCE=2 -g -fdebug-prefix-map=/build/python2.7-rzpqx3/python2.7-2.7.16=. -fstack-protector-strong -Wformat -Werror=format-security build/temp.linux-x86_64-2.7/anuga/utilities/cg_ext.o -Lbuild/temp.linux-x86_64-2.7 -o build/lib.linux-x86_64-2.7/anuga/utilities/cg_ext.so -fopenmp -fstack-protector

When

  1. I check on the compiled library with ldd I get

build/lib.linux-x86_64-2.7/anuga/utilities/cg_ext.so linux-vdso.so.1 (0x00007fff7a9fa000) libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007f0650502000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f0650317000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f0650311000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f06502f0000)
/lib64/ld-linux-x86-64.so.2 (0x00007f0650606000)

so I presume that everything has been setup correctly. I now move on to

  1. altering the pragma comments on one routine as follows:

before:

void cg_daxpy(int N, double a, double *x, double *y)
{
  int i;
  #pragma omp parallel for private(i)
  for(i=0;i<N;i++)
  {
    y[i]=y[i]+a*x[i];
  }
}

after:

void cg_daxpy(int N, double a, double *x, double *y)
{
  int i;
  #pragma omp target device(0)
  {
  #pragma omp parallel for
  for(i=0;i<N;i++)
  {
    y[i]=y[i]+a*x[i];
  }
  }
}

I then recompile an install and run my script as follows in the hope of getting profiling information:

nvprof --print-gpu-trace --profile-child-processes --profile-from-start off -fo %p.nvprof python -m cProfile runDamBreak.py

This returns the message

==19444== Profiling application: orted --hnp --set-sid --report-uri 14 --singleton-died-pipe 15 -mca state_novm_select 1 -mca ess hnp -mca pmix ^s1,s2,cray,isolated
==19444== Profiling result:
No kernels were profiled.

So in conclusion I understand that the pragmas are understood by the compiler, but no segments are sent to the GPU. Any hints on how to debug further are greatly appreciated.

Best regards

Sebastian

89f3a1c
  • 1,430
  • 1
  • 14
  • 24
Sebastian
  • 1
  • 2

1 Answers1

0

Most binary packages of gcc/llvm/clang have support for GPU's disabled, you need to compile your own compiler to enable it.

As far as I can tell, gcc cannot produce shared libs with offloading, so you are probably stuck with llvm if you want to create a C-extension for Python.

However, I am facing a similar issue: I use llvm with offloading definitely enabled. Whenever I try to use the OpenMP target within the Cython-generated code, it simply does not find any devices (omp_get_num_devices() returning 0). Running the exact same code in a plain C program does work just fine, even when explicitly dlopen'ing a .so with the function calling omp_get_num_devices(). This is really strange.

Frank
  • 56
  • 4