Can this parallelism be implemented in OpenCL

Question

This is my first post. I'll try to keep it short because I value your time. This community has been incredible to me.

I am learning OpenCL and want to extract a little bit of parallelism from the below algorithm. I will only show you the part that I am working on, which I've also simplified as much as I can.

1) Inputs: Two 1D arrays of length (n): A, B, and value of n. Also values C[0], D[0].

2) Outputs: Two 1D arrays of length (n): C, D.

C[i] = function1(C[i-1])
D[i] = function2(C[i-1],D[i-1])

So these are recursive definitions, however the calculation of C & D for a given i value can be done in parallel (they are obviously more complicated, so as to make sense). A naive thought would be creating two work items for the following kernel:

__kernel void test (__global float* A, __global float* B, __global float* C,
                    __global float* D, int n, float C0, float D0) {
    int i, j=get_global_id(0);

    if (j==0) {
       C[0] = C0;
       for (i=1;i<=n-1;i++) {
          C[i] = function1(C[i-1]);
          [WAIT FOR W.I. 1 TO FINISH CALCULATING D[i]];
       }
       return;
    }
    else {
       D[0] = D0;
       for (i=1;i<=n-1;i++) {
          D[i] = function2(C[i-1],D[i-1]);
          [WAIT FOR W.I. 0 TO FINISH CALCULATING C[i]];
       }
       return;
    }
}

Ideally each of the two work items (numbers 0,1) would do one initial comparison and then enter their respective loop, synchronizing for each iteration. Now given the SIMD implementation of GPUs, I assume that this will NOT work (work items would be waiting for all of the kernel code), however is it possible to assign this type of work to two CPU cores and have it work as expected? What will the barrier be in this case?

do you require that all values of C and D be saved, or do you only care about the final result? — mfa, Feb 12 '16 at 14:14

score 1 · Answer 1 · answered Feb 12 '16 at 14:32

This can be implemented in opencl, but like the other answer says, you're going to be limited to 2 threads at best.

My version of your function should be called with a single work group having two work items.

__kernel void test (__global float* A, __global float* B, __global float* C, __global float* D, int n, float C0, float D0)
{
    int i;
    int gid = get_global_id(0);

    local float prevC;
    local float prevD;

    if (gid == 0) {
        C[0] = prevC = C0;
        D[0] = prevD = D0;
    }

    barrier(CLK_LOCAL_MEM_FENCE);

    for (i=1;i<=n-1;i++) {
        if(gid == 0){
            C[i] = function1(prevC);
        }else if (gid == 1){
            D[i] = function2(prevC, prevD);
        }

        barrier(CLK_LOCAL_MEM_FENCE);
        prevC = C[i];
        prevD = D[i];
    }
}

This should run on any opencl hardware. If you don't care about saving all of the C and D values, you can simply return prevC and prevD in two floats rather than the entire list. This would also make it much faster due to sticking to a lower cache level (ie local memory) for all reading and writing of the intermediate values. The local memory boost should also apply to all opencl hardware.

So is there a point to running this on a GPU? Not for the parallelism. You are stuck with 2 threads. But if you don't need all values of C and D returned, you would probably see a significant speed up because of the much faster memory of GPUs.

All of this assumes that function1 and function2 aren't overly complex. If they are, just stick to CPUs -- and probably another multiprocessing technique such as OpenMP.

DarkZeros · Answer 2 · 2016-02-05T13:51:06.497

0

Dependency in your case is completely linear/recursive (i needs i-1). Not even logaritmic like other problems (reduction, sum, sort, etc.). And therefore this problem does not fit well in a SIMD device.

The best you can do is go a 2 threads approach in CPU. Thread 1 will "produce" data (C value), for thread 2.

A very naive approach for example:

Thread 1:
for(){
    ProcessC(i);
    atomic_inc(counter); //This function should unlock
}

Thread 2:
for(){
    atomic_dec(counter); //This function should lock
    ProcessD(i);
}

Where atomic_inc and atomic_dec can be implemented with counting semaphores for example.

edited Feb 05 '16 at 13:51

answered Feb 05 '16 at 11:02

DarkZeros

8,235
1
26
36

Also, not all global work items even execute together (they may get done in blocks that must wholly finish before the next set is done), so waiting is not an option. – Dithermaster Feb 05 '16 at 20:19
OK, so my kernel would essentially stay the same except for those atomic functions that you've mentioned? (gotta say, I'm still reading about openCL synchronization so can't grasp what these functions do atm). And if run on a CPU the code would branch on the if statement like expected? – Shoomla Feb 08 '16 at 10:07

Can this parallelism be implemented in OpenCL

2 Answers2