0

I'm Trying to convert a code written in Cuda to openCL and run into some trouble. My final goal is to implement the code on an Odroid XU3 board with a Mali T628 GPU.

In order to simplify the transition and save time trying to debug openCL kernels I've done the following steps:

  1. Implement the code in Cuda and test it on a Nvidia GeForce 760
  2. Implement the code in openCL and test it on a Nvidia GeForce 760
  3. test the openCL code on an Odroid XU3 board with a Mali T628 GPU.

I know that different architectures may have different optimizations but that isn't my main concern for now. I manged to run the openCL code on my Nvidia GPU with no apparent issues but keep getting strange errors when trying to run the code on the Odroid board. I know that different architectures have different handling of exceptions etc. but I'm not sure how to solve those.

Since the openCL code works on my Nvidia I assume that I managed to do the correct transition between thread/blocks -> workItems/workGroups etc. I already fixed several issues that relate to the cl_device_max_work_group_size issue so that can't be the cuase.

When running the code i'm getting a "CL_OUT_OF_RESOURCES" error. I've narrowed the cause of the error to 2 lines in the code but not sure to fix those issues.

the error is caused by the following lines:

  1. lowestDist[pixelNum] = partialDiffSumTemp; both variables are private variables of the kernel and therefor I don't see any potential issue.
  2. d_disparityLeft[globalMemIdx + TILE_BOUNDARY_WIDTH - WINDOW_RADIUS + 0] = bestDisparity[0]; Here I guess the cause is "OUT_OF_BOUND" but not sure how to debug it since the original code doesn't have any issue.

My Kernel code is is:

#define ALIGN_IMAGE_WIDTH          64
#define NUM_PIXEL_PER_THREAD        4

#define MIN_DISPARITY               0  
#define MAX_DISPARITY              55  

#define WINDOW_SIZE                19 
#define WINDOW_RADIUS              (WINDOW_SIZE / 2)   

#define TILE_SHARED_MEM_WIDTH      96                       
#define TILE_SHARED_MEM_HEIGHT     32
#define TILE_BOUNDARY_WIDTH        64
#define TILE_BOUNDARY_HEIGHT       (2 * WINDOW_RADIUS)

#define BLOCK_WIDTH                (TILE_SHARED_MEM_WIDTH  - TILE_BOUNDARY_WIDTH) 
#define BLOCK_HEIGHT               (TILE_SHARED_MEM_HEIGHT - TILE_BOUNDARY_HEIGHT)  

#define THREAD_NUM_WIDTH            8
#define THREADS_NUM_HEIGHT         TILE_SHARED_MEM_HEIGHT

 //TODO fix input arguments
__kernel void hello_kernel( __global unsigned char*  d_leftImage,
                            __global unsigned char*  d_rightImage,
                            __global float* d_disparityLeft) {

    int blockX      = get_group_id(0);
    int blockY      = get_group_id(1);
    int threadX     = get_local_id(0);
    int threadY     = get_local_id(1);

    __local unsigned char leftImage      [TILE_SHARED_MEM_WIDTH * TILE_SHARED_MEM_HEIGHT];
    __local unsigned char rightImage     [TILE_SHARED_MEM_WIDTH * TILE_SHARED_MEM_HEIGHT];
    __local unsigned int  partialDiffSum [BLOCK_WIDTH           * TILE_SHARED_MEM_HEIGHT];

    int alignedImageWidth = 640;
    int partialDiffSumTemp;
    float bestDisparity[4] = {0,0,0,0};
    int lowestDist[4];
        lowestDist[0] = 214748364;
        lowestDist[1] = 214748364;
        lowestDist[2] = 214748364;
        lowestDist[3] = 214748364;

    // Read image blocks into shared memory. read is done at 32bit integers on a uchar array. each thread reads 3 integers(12byte) 96/12=8threads
    int sharedMemIdx = threadY * TILE_SHARED_MEM_WIDTH + 4 * threadX; 
    int globalMemIdx = (blockY * BLOCK_HEIGHT + threadY) * alignedImageWidth + blockX * BLOCK_WIDTH + 4 * threadX; 

    for (int i = 0; i < 4; i++) {
        leftImage [sharedMemIdx                        + i ] = d_leftImage [globalMemIdx                        + i];
        leftImage [sharedMemIdx + 4 * THREAD_NUM_WIDTH + i ] = d_leftImage [globalMemIdx + 4 * THREAD_NUM_WIDTH + i];
        leftImage [sharedMemIdx + 8 * THREAD_NUM_WIDTH + i ] = d_leftImage [globalMemIdx + 8 * THREAD_NUM_WIDTH + i];
        rightImage[sharedMemIdx                        + i ] = d_rightImage[globalMemIdx                        + i];
        rightImage[sharedMemIdx + 4 * THREAD_NUM_WIDTH + i ] = d_rightImage[globalMemIdx + 4 * THREAD_NUM_WIDTH + i];
        rightImage[sharedMemIdx + 8 * THREAD_NUM_WIDTH + i ] = d_rightImage[globalMemIdx + 8 * THREAD_NUM_WIDTH + i];
    }

    barrier(CLK_LOCAL_MEM_FENCE);

    int imageIdx = sharedMemIdx + TILE_BOUNDARY_WIDTH - WINDOW_RADIUS;
    int partialSumIdx = threadY * BLOCK_WIDTH + 4 * threadX;

    for(int dispLevel = MIN_DISPARITY; dispLevel <= MAX_DISPARITY; dispLevel++) {

        // horizontal partial sum
        partialDiffSumTemp = 0;
        #pragma unroll
        for(int i = imageIdx - WINDOW_RADIUS; i <= imageIdx + WINDOW_RADIUS; i++) {
                    //partialDiffSumTemp += calcDiff(leftImage [i], rightImage[i - dispLevel]);
                      partialDiffSumTemp += abs(leftImage[i] - rightImage[i - dispLevel]);
        }
        partialDiffSum[partialSumIdx] = partialDiffSumTemp;

        barrier(CLK_LOCAL_MEM_FENCE);

        for (int pixelNum = 1, i = imageIdx - WINDOW_RADIUS; pixelNum < NUM_PIXEL_PER_THREAD; pixelNum++, i++) {
            partialDiffSum[partialSumIdx + pixelNum] = partialDiffSum[partialSumIdx + pixelNum - 1] + 
                                                       abs(leftImage[i + WINDOW_SIZE] - rightImage[i - dispLevel + WINDOW_SIZE]) -
                                                       abs(leftImage[i]               - rightImage[i - dispLevel]);
        }

        barrier(CLK_LOCAL_MEM_FENCE);

        // vertical sum
        if(threadY >= WINDOW_RADIUS && threadY < TILE_SHARED_MEM_HEIGHT - WINDOW_RADIUS) {

            for (int pixelNum = 0; pixelNum < NUM_PIXEL_PER_THREAD; pixelNum++) {
                int rowIdx = partialSumIdx - WINDOW_RADIUS * BLOCK_WIDTH;
                partialDiffSumTemp = 0;

                    for(int i = -WINDOW_RADIUS; i <= WINDOW_RADIUS; i++,rowIdx += BLOCK_WIDTH) {
                           partialDiffSumTemp += partialDiffSum[rowIdx + pixelNum];
                    }

                    if (partialDiffSumTemp < lowestDist[pixelNum]) {
                        lowestDist[pixelNum]    = partialDiffSumTemp;
                        bestDisparity[pixelNum] = dispLevel - 1;
                    }


            }
        }

    }

    if (threadY >= WINDOW_RADIUS && threadY < TILE_SHARED_MEM_HEIGHT - WINDOW_RADIUS && blockY < 32) {

        d_disparityLeft[globalMemIdx + TILE_BOUNDARY_WIDTH - WINDOW_RADIUS + 0] = bestDisparity[0];
        d_disparityLeft[globalMemIdx + TILE_BOUNDARY_WIDTH - WINDOW_RADIUS + 1] = bestDisparity[1];
        d_disparityLeft[globalMemIdx + TILE_BOUNDARY_WIDTH - WINDOW_RADIUS + 2] = bestDisparity[2];
        d_disparityLeft[globalMemIdx + TILE_BOUNDARY_WIDTH - WINDOW_RADIUS + 3] = bestDisparity[3];
    }

}

Thanks for all the help

Yuval

talonmies
  • 70,661
  • 34
  • 192
  • 269
Yuval
  • 1
  • 5
  • GPU code is hard to debug, and particularly when it comes to unusual hardware. It's hard to imagine how an "answer" to this "question" could look like: One can only try to *guess* what *might* be wrong. However, it is right that an out-of-bounds access may cause a `CL_OUT_OF_RESOURCES` error. So an alternative to `printf` debugging: You may also run your program with `cuda-memcheck YourProgram.exe`: It will print whether there are invalid memory accesses (it might even be possible to obtain the line number information, but I'm not sure about that) – Marco13 Jun 07 '15 at 23:32
  • I know this is old, but I had a similar issue. I'm launching multiple kernels and I kept getting the "out of resources" error. Most of the kernels run now without errors after I reduced my usage of private variables in the kernels, so it could be running out of registers...? It is a very strange issue and I'm yet to fix this last kernel. Another thing to note is that Mali GPUs report their shared memory types as "global", so there may not be any performance gain from it, and I get these errors when accessing local memory. So one possible solution is to eliminate shared memory usage. – Val9265 Feb 05 '18 at 14:13
  • The user posted this question on the ARM community forum and it seems the problem was in the local work-size. Addressing this issue fixed my problem as well. It's strange through, since I would expect an error complaining about the worksize being invalid (as it did several times before) since I was using a 8*32 local work-size. – Val9265 Feb 05 '18 at 14:21

1 Answers1

0

From my experience NVidia GPUs not always crash on out of bound access and many times kernel still returns expected results.

Use printf to check the indexes. If you have Nvidia OpenCL 1.2 driver installed printf should be available as a core function. As far as I checked Mali-T628 uses OpenCL 1.1 then check if printf is available as a vendor extension. Also you can run your kernel on AMD/Intel CPU where printf is available (OpenCL 1.2 / 2.0).

Alternative way of checking indexes can be passing __global int* debug array where you would store indexes and then check them on the host. Make sure to allocate it big enough so that out of bound index will be recorded.

doqtor
  • 8,414
  • 2
  • 20
  • 36