1

I'm working on a project which requires OpenMP offloading to Nvidia GPUs using Clang. I was able to install Clang to support offloading by following instructions mentioned here.

System specification

  • OS - Ubuntu 16.04 LTS
  • Clang -version 4.00
  • Processor - Intel(R) Core(TM) i7 -4700MQ CPU
  • Cuda -version - 9.0
  • Nvidia GPU - GeForce 740M (sm_capability - 35)

But the problem is I when I execute a sample program to test OpenMP to Nvidia GPUs, part of the target region tends to run in GPU and then same target region starts executing in the host.

Please find the sample program here, This a small C program written to multiply 2 matrices.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>

/* Problem size. */
# define N 1920

void init_array(float* A, float* B)
{
    int i, j;
    for (i = 0; i < N; i++)
    {
        for (j = 0; j < N; j++)
        {
            A[i*N + j] = ((float) i*j) / N;
        }
    }

    for (i = 0; i < N; i++)
    {
        for (j = 0; j < N; j++)
        {
            B[i*N + j] = ((float) i*(j+1)) / N;
        }
    }
}
void  mm_kernel(float *A, float *B, float *E)
{

    #pragma omp target data map(to:A) map(to:B) map(alloc:E)
{
    #pragma omp target
    #pragma omp teams distribute num_teams(4)
        for (int i = 0; i < N; i++)
  {
        printf("Team %d Thread %d Number of threads %d \n", omp_get_team_num() ,omp_get_thread_num(),omp_get_num_threads());
        #pragma omp  parallel for
        for (int j = 0; j < N; j++)
    {
            E[i*N + j] = 0.0;
            for(int k = 0; k < N; k++)
            {
                E[i*N + j] = E[i*N + j] + A[i*N + k] * B[j*N+k];
            }
    }
    }
  }
    }

int main(){
  double t_start, t_end;

    float* A;
    float* B;
    float* E;

    A = (float*)malloc(N*N*sizeof(float));
    B = (float*)malloc(N*N*sizeof(float));
    E = (float*)malloc(N*N*sizeof(float));
    init_array(A, B); //initialize Matrix A and B

    t_start = omp_get_wtime();
    mm_kernel(A,B,E);
    t_end = omp_get_wtime();

    printf("Time spent %lf\n",t_end-t_start );
    free(A);
    free(B);
    free(E);
}

The program was complied using

clang -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda 3mm.c -o 3mmgpu 

The Main reason to claim that target regions are executing in both the host and the target device is due to the output from the command line.

command line output

At first team 0 and team 1 shows 960 per each team and later iterations gives 2 threads per each teams(My processor is 4 core processor capable of handling 2 hardware level threads per core.).

I also tried executing the fat binary with nvprof in order to verify whether anything is being executed in the GPU.

profiling results are as follows.

profiling result

Actually I cannot understand what is happening in the target region. Why the target region is being executed in both host and target-device.

piyumi_rameshka
  • 320
  • 4
  • 9
  • 1
    please don't post screen captures instead of text copy-paste. – Robert Crovella Apr 08 '18 at 19:39
  • I have not manged to get Clang-ykt to compile. I tried with Ubuntu 17.10 but it says It does not support GCC 6 or more. Then I tried with VitualBox using Ubuntu 16.04 which gets different errors. I spent too much time on this. How did you get it to build? – Z boson Apr 10 '18 at 12:48
  • @Zboson I followed these links [link 1](https://github.com/clang-ykt/clang/wiki) and [link 2](http://on-demand.gputechconf.com/gtc/2016/presentation/s6510-jeff-larkin-targeting-gpus-openmp.pdf). In addition to the cmake build options given in link 1, you may need extra build options to specify GCC and G++ path as mentioned in link 2. And also by default if you don't specify TARGETS_TO_BUILD it will build all built in targets. In that case you can use -DLLVM_TARGETS_TO_BUILD to customize your targets.. – piyumi_rameshka Apr 10 '18 at 18:31

1 Answers1

2

I'm posting the answer to the question, as I was finally able to figure out what went wrong in the code. The problem was offloaded region in the target-device crashes as I have incorrectly mapped data to the GPU. I have only mapped pointers without allocating memory in the GPU. So as the GPU execution crashes, execution happens in the host.

Thank you @Alexey Bataev for pointing that out.

piyumi_rameshka
  • 320
  • 4
  • 9
  • 1
    Please note that this behavior is going to be changed. What are we going to do is to stop the execution of the application if the offloading was successful, but then the execution was failed. The host code must be executed only(!) if the offloading attempt was not successful – Alexey Bataev Apr 13 '18 at 13:28