OpenCL Shared Memory Among Tasks

Question

I've been working to create a GPU based conway's game of life program. If you're not familiar with it, here is the Wikipedia Page. I created one version that works by keeping an array of values where 0 represents a dead cell, and 1 a live one. The kernel then simply writes to an image buffer data array to draw an image based on the cell data and then checks each cell's neighbors to update the cell array for the next execution to render.

However, a faster method instead represents the value of a cell as a negative number if dead and a positive number if alive. The number of that cell represents the amount of neighbors it has plus one (making zero an impossible value since we cannot differentiate 0 from -0). However this means that when spawning or killing a cell we must update it's eight neighbor's values accordingly. Thus unlike the working procedure, which only has to read from the neighboring memory slots, this procedure must write to those slots. Doing so is inconsistent and the outputted array is not valid. For example cells contain numbers such as 14 which indicates 13 neighbors, an impossible value. The code is correct as I wrote the same procedure on the cpu and it works as expected. After testing, I believe that when tasks try to write to the memory at the same time there is a delay that leads to a writing error of some kind. For example, perhaps there is a delay between reading the array data and setting in which time the data is changed making another task's procedure incorrect. I've tried using semaphors and barriers, but have just learned OpenCL and parallel processing and don't quite grasp them completely yet. The kernel is as follows.

int wrap(int val, int limit){
    int response = val;
    if(response<0){response+=limit;}
    if(response>=limit){response-=limit;}
    return response;
}

__kernel void optimizedModel(
        __global uint *output,
        int sizeX, int sizeY,
        __global uint *colorMap,
        __global uint *newCellMap,
        __global uint *historyBuffer
)
{
    // the x and y coordinates that currently being computed
    unsigned int x = get_global_id(0);
    unsigned int y = get_global_id(1);

    int cellValue = historyBuffer[sizeX*y+x];
    int neighborCount = abs(cellValue)-1;
    output[y*sizeX+x] = colorMap[cellValue > 0 ? 1 : 0];

    if(cellValue > 0){// if alive
        if(neighborCount < 2 || neighborCount > 3){
            // kill

            for(int i=-1; i<2; i++){
                for(int j=-1; j<2; j++){
                    if(i!=0 || j!=0){
                        int wxc = wrap(x+i, sizeX);
                        int wyc = wrap(y+j, sizeY);
                        newCellMap[sizeX*wyc+wxc] -= newCellMap[sizeX*wyc+wxc] > 0 ? 1 : -1;
                    }
                }
            }
            newCellMap[sizeX*y+x] *= -1;

            // end kill
        }
    }else{
        if(neighborCount==3){
            // spawn

            for(int i=-1; i<2; i++){
                for(int j=-1; j<2; j++){
                    if(i!=0 || j!=0){
                        int wxc = wrap(x+i, sizeX);
                        int wyc = wrap(y+j, sizeY);
                        newCellMap[sizeX*wyc+wxc] += newCellMap[sizeX*wyc+wxc] > 0 ? 1 : -1;
                    }
                }
            }
            newCellMap[sizeX*y+x] *= -1;

            // end spawn
        }
    }
}

The array output is the image buffer data used to render the kernel's computation.
The sizeX and sizeY constants are the width and height of the image buffer respectively.
The colorMap array contains the rgb integer values for black and white respectively which are used to change the image buffer's values properly to render colors.
The newCellMap array is the updated cell map being calculated once rendering is determined.
The historyBuffer is the old state of the cells at the beginning of the kernel call. Every time the kernel is executed, this array is updated to the newCellMap array.

Additionally the wrap function makes the space toroidal. How could I fix this code such that it works as expected. And why doesn't the global memory update with each change by a task? Isn't it supposed to be shared memory?

The answer is quite simple. Reads and writes to the same memory location from different threads in a single kernel call is undefined. The only way to make them work is barriers, and even those work only within a single workgroup. — sharpneli, Nov 18 '13 at 08:30
So then wouldn't this fail at cells bordering a work group? Similarly, how would a barrier in an if act? Because not all tasks would encounter it. — Hunter Larco, Nov 18 '13 at 11:45
@HunterLarco Yes, Indeeed it will fails in cells bordering a work group but only if you set proper barriers. As you don't, it fails in any cell. You cannot have a barrier inside an if, since a barrier **HAS** to be encountered by all the work-items. — DarkZeros, Nov 18 '13 at 12:33

score 1 · Accepted Answer · answered Nov 18 '13 at 09:56

As sharpneli said in his answer, you are reading and writing same memory zones from different threads and that gives an undefined behaviour.

Solution: You need to split your newCellMap in 2 arrays, one for the previous execution and one where the new value will be stored. Then, you need to change the kernel arguments from the host side in each call, so that the oldvalues of the next iteration are the newvalues of the previous iteration. Due to how you structurize your algorithm, you will also need to perform a copybuffer of oldvalues to newvalues before you run it.

__kernel void optimizedModel(
        __global uint *output,
        int sizeX, int sizeY,
        __global uint *colorMap,
        __global uint *oldCellMap,
        __global uint *newCellMap,
        __global uint *historyBuffer
)
{
    // the x and y coordinates that currently being computed
    unsigned int x = get_global_id(0);
    unsigned int y = get_global_id(1);

    int cellValue = historyBuffer[sizeX*y+x];
    int neighborCount = abs(cellValue)-1;
    output[y*sizeX+x] = colorMap[cellValue > 0 ? 1 : 0];

    if(cellValue > 0){// if alive
        if(neighborCount < 2 || neighborCount > 3){
            // kill

            for(int i=-1; i<2; i++){
                for(int j=-1; j<2; j++){
                    if(i!=0 || j!=0){
                        int wxc = wrap(x+i, sizeX);
                        int wyc = wrap(y+j, sizeY);
                        newCellMap[sizeX*wyc+wxc] -= oldCellMap[sizeX*wyc+wxc] > 0 ? 1 : -1;
                    }
                }
            }
            newCellMap[sizeX*y+x] *= -1;

            // end kill
        }
    }else{
        if(neighborCount==3){
            // spawn

            for(int i=-1; i<2; i++){
                for(int j=-1; j<2; j++){
                    if(i!=0 || j!=0){
                        int wxc = wrap(x+i, sizeX);
                        int wyc = wrap(y+j, sizeY);
                        newCellMap[sizeX*wyc+wxc] += oldCellMap[sizeX*wyc+wxc] > 0 ? 1 : -1;
                    }
                }
            }
            newCellMap[sizeX*y+x] *= -1;

            // end spawn
        }
    }
}

Regarding your question about shared memory has a simple answer. OpenCL does not have shared memory across HOST-DEVICE.

When you create a memory buffer for the device, you first have to init that memory zone with clEnqueueWriteBuffer() and read it with clEnqueueWriteBuffer() to get the results. Even if you do have a pointer to the memory zone, your pointer is a pointer to the host side copy of that zone. Which is likely not to have the last version of device computed output.

PD: I created long time ago a "Live" game on OpenCL, I found that the easyer and faster way to do it is simply to create a big 2D array of bits (bit addressing). And then write a piece of code without any branches that simply analize the neibours and gets the updated value for that cell. Since bit addressing is used, the amount of memory read/write by each thread is considerably lower that addressing chars/ints/other. I achieved 33Mcells/sec in a very old OpenCL HW (nVIDIA 9100M G). Just to let you know that your if/else approach is probably not the most efficient one.

OpenCL 2.0 and above supports shared memory between host and device by means of [Shared Virtual Memory](https://www.khronos.org/registry/OpenCL/sdk/2.0/docs/man/xhtml/sharedVirtualMemory.html), but it wouldn't have been easy to know this at the time this answer was posted; the Khronos Group announced [the ratification of OpenCL 2.0](https://www.khronos.org/news/press/khronos-finalizes-opencl-2.0-specification-for-heterogeneous-computing) that same day. :) — cqcallaw, Oct 28 '18 at 03:25

score 1 · Answer 2 · answered Nov 18 '13 at 20:29

Just as a reference, I let you here my implementation of the game of life (OpenCL kernel):

//Each work-item processess one 4x2 block of cells, but needs to access to the (3x3)x(4x2) block of cells surrounding it
//    . . . . . .
//    . * * * * .
//    . * * * * .
//    . . . . . .

 __kernel void life (__global unsigned char * input, __global unsigned char * output){

    int x_length = get_global_size(0);
    int x_id = get_global_id(0);
    int y_length = get_global_size(1);
    int y_id = get_global_id(1);
    //int lx_length = get_local_size(0);
    //int ly_length = get_local_size(1);

    int x_n = (x_length+x_id-1)%x_length; //Negative X
    int x_p = (x_length+x_id+1)%x_length; //Positive X
    int y_n = (y_length+y_id-1)%y_length; //Negative Y
    int y_p = (y_length+y_id+1)%y_length; //Positive X

    //Get the data of the surrounding blocks (TODO: Make this shared across the local group)
    unsigned char block[3][3];
    block[0][0] = input[x_n + y_n*x_length];
    block[1][0] = input[x_id + y_n*x_length];
    block[2][0] = input[x_p + y_n*x_length];
    block[0][1] = input[x_n + y_id*x_length];
    block[1][1] = input[x_id + y_id*x_length];
    block[2][1] = input[x_p + y_id*x_length];
    block[0][2] = input[x_n + y_p*x_length];
    block[1][2] = input[x_id + y_p*x_length];
    block[2][2] = input[x_p + y_p*x_length];

    //Expand the block to points (bool array)
    bool point[6][4];
    point[0][0] = (bool)(block[0][0] & 1);
    point[1][0] = (bool)(block[1][0] & 8);
    point[2][0] = (bool)(block[1][0] & 4);
    point[3][0] = (bool)(block[1][0] & 2);
    point[4][0] = (bool)(block[1][0] & 1);
    point[5][0] = (bool)(block[2][0] & 8);
    point[0][1] = (bool)(block[0][1] & 16);
    point[1][1] = (bool)(block[1][1] & 128);
    point[2][1] = (bool)(block[1][1] & 64);
    point[3][1] = (bool)(block[1][1] & 32);
    point[4][1] = (bool)(block[1][1] & 16);
    point[5][1] = (bool)(block[2][1] & 128);
    point[0][2] = (bool)(block[0][1] & 1);
    point[1][2] = (bool)(block[1][1] & 8);
    point[2][2] = (bool)(block[1][1] & 4);
    point[3][2] = (bool)(block[1][1] & 2);
    point[4][2] = (bool)(block[1][1] & 1);
    point[5][2] = (bool)(block[2][1] & 8);
    point[0][3] = (bool)(block[0][2] & 16);
    point[1][3] = (bool)(block[1][2] & 128);
    point[2][3] = (bool)(block[1][2] & 64);
    point[3][3] = (bool)(block[1][2] & 32);
    point[4][3] = (bool)(block[1][2] & 16);
    point[5][3] = (bool)(block[2][2] & 128);

    //Process one point of the game of life!
    unsigned char out = (unsigned char)0;
    for(int j=0; j<2; j++){
        for(int i=0; i<4; i++){
            char num = point[i][j] + point[i+1][j] + point[i+2][j] + point[i][j+1] + point[i+2][j+1] + point[i][j+2] + point[i+1][j+2] + point[i+2][j+2];
            if(num == 3 || num == 2 && point[i+1][j+1] ){
                out |= (128>>(i+4*j));
            }
        }
    }
    output[x_id + y_id*x_length] = out; //Assign to the output the new cells value
};

Here you don't save any intermediate states, just the cell status at the end (live/death). It does not have branches, so it is quite fast in the process.

OpenCL Shared Memory Among Tasks

2 Answers2

Linked