0

I'm using GCC 9.3 on Ubuntu 20.04. I want to offload the famous SAXPY example to GPU using OpenMP. I installed GCC's offloading capabilities by sudo apt install gcc-9-offload-nvptx . Then compiled the following code by g++ -fopenmp main.cpp:

int main()
{
    const size_t kNumel = 999999;

    float x[kNumel];
    float y[kNumel];

    for (size_t i=0 ;i <kNumel; i++)
    {
        x[i] = i;
        y[i] = i;
    }


    const float kCoef = 1.23f;

    #pragma omp target teams distribute parallel for
    for (size_t i=0; i < kNumel; i++)
    {
        y[i] = kCoef*x[i] + y[i];
    }

    return 0;
}

BUT it doesn't compile and shows this error:

to1: error: ‘-fcf-protection=full’ is not supported for this target
mkoffload: fatal error: x86_64-linux-gnu-accel-nvptx-none-gcc-9 returned 1 exit status
compilation terminated.
lto-wrapper: fatal error: /usr/lib/gcc/x86_64-linux-gnu/9//accel/nvptx-none/mkoffload 
returned 1 exit status
compilation terminated.
/usr/bin/ld: error: lto-wrapper failed
collect2: error: ld returned 1 exit status

I added -fno-stack-protector but the same error is reproduced.

  • Ubuntu and its default flags... The error message is pretty clear though, I don't know why you think that -fno-stack-protector is the way to disable -fcf-protection=full... – Marc Glisse Feb 11 '21 at 17:28
  • It was mentioned in other forums that fno-stack-protector is required for Ubuntu 18.04. So I tried it. The "fcf-protection" can get these values: [full,branch,return, check, none]. I tested all. No success:( When it is set to "none" or "check", the compiler throws this: `unresolved symbol __stack_chk_fail` – zana zakaryaie nejad Feb 12 '21 at 05:46
  • Managing to change the error message is progress, "none" seems good. Now to try and solve the next issue, the unresolved symbol ;-) – Marc Glisse Feb 12 '21 at 09:30
  • Thanks @MarcGlisse. I put both -fcf-protection=none and -fno-stack-protector. It compiles now. But the runtime is orders of magnitude slower than CPU version. I have to find good OpenMP clauses. – zana zakaryaie nejad Feb 12 '21 at 10:50
  • The GPU would chew through the data in a matter of microseconds. It is the time it takes to allocate data buffers on the GPU, copy the data from the host to the GPU, launch the kernel, wait for it to finish, and copy the data back from the GPU to the host, that makes the code orders of magnitude slower. You need to put MUCH more work in that loop. – Hristo Iliev Feb 12 '21 at 22:49
  • I got the point. Thank you @HristoIliev. – zana zakaryaie nejad Feb 13 '21 at 05:54

0 Answers0