I am using the modeling toolbox Anuga and have set it up to run with parallel support. To my current knowledge the mechanism behind is that Numpy is being extended by modules in C which are exposed to OpenMP through
extra_args = ['-fopenmp']
I have developed and tested a script to run through mpirun -np 4 python <myscript.py>
and it works. Since models are getting bigger my interest is to shift some processing to a GPU in the physical form of a NVIDIA GPU through OpenMP. I read about this being called Offloading. I have installed a Quadro K2000 with
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56 Driver Version: 418.56 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro K2000 Off | 00000000:01:00.0 On | N/A |
| 32% 48C P8 N/A / N/A | 403MiB / 1999MiB | 4% Default |
+-------------------------------+----------------------+----------------------+
So I
installed
gcc-offload-nvptx
on my Ubuntu 19.04, which reads version 8 of gcc. I thenaltered the compiler flags to
extra_args = ['-fopenmp', '-fstack-protector']
and
- compiled the installation through
python setup.py build
. This returns the following message for the targeted modulecg_ext.c
without any further error:
x86_64-linux-gnu-gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -Wdate-time -D_FORTIFY_SOURCE=2 -g -fdebug-prefix-map=/build/python2.7-rzpqx3/python2.7-2.7.16=. -fstack-protector-strong -Wformat -Werror=format-security -Wl,-Bsymbolic-functions -Wl,-z,relro -Wdate-time -D_FORTIFY_SOURCE=2 -g -fdebug-prefix-map=/build/python2.7-rzpqx3/python2.7-2.7.16=. -fstack-protector-strong -Wformat -Werror=format-security build/temp.linux-x86_64-2.7/anuga/utilities/cg_ext.o -Lbuild/temp.linux-x86_64-2.7 -o build/lib.linux-x86_64-2.7/anuga/utilities/cg_ext.so -fopenmp -fstack-protector
When
- I check on the compiled library with
ldd
I get
build/lib.linux-x86_64-2.7/anuga/utilities/cg_ext.so linux-vdso.so.1 (0x00007fff7a9fa000) libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007f0650502000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f0650317000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f0650311000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f06502f0000)
/lib64/ld-linux-x86-64.so.2 (0x00007f0650606000)
so I presume that everything has been setup correctly. I now move on to
- altering the pragma comments on one routine as follows:
before:
void cg_daxpy(int N, double a, double *x, double *y)
{
int i;
#pragma omp parallel for private(i)
for(i=0;i<N;i++)
{
y[i]=y[i]+a*x[i];
}
}
after:
void cg_daxpy(int N, double a, double *x, double *y)
{
int i;
#pragma omp target device(0)
{
#pragma omp parallel for
for(i=0;i<N;i++)
{
y[i]=y[i]+a*x[i];
}
}
}
I then recompile an install and run my script as follows in the hope of getting profiling information:
nvprof --print-gpu-trace --profile-child-processes --profile-from-start off -fo %p.nvprof python -m cProfile runDamBreak.py
This returns the message
==19444== Profiling application: orted --hnp --set-sid --report-uri 14 --singleton-died-pipe 15 -mca state_novm_select 1 -mca ess hnp -mca pmix ^s1,s2,cray,isolated
==19444== Profiling result:
No kernels were profiled.
So in conclusion I understand that the pragmas are understood by the compiler, but no segments are sent to the GPU. Any hints on how to debug further are greatly appreciated.
Best regards
Sebastian