This is my first post. I'll try to keep it short because I value your time. This community has been incredible to me.
I am learning OpenCL and want to extract a little bit of parallelism from the below algorithm. I will only show you the part that I am working on, which I've also simplified as much as I can.
1) Inputs: Two 1D arrays of length (n): A, B, and value of n. Also values C[0], D[0].
2) Outputs: Two 1D arrays of length (n): C, D.
C[i] = function1(C[i-1])
D[i] = function2(C[i-1],D[i-1])
So these are recursive definitions, however the calculation of C & D for a given i value can be done in parallel (they are obviously more complicated, so as to make sense). A naive thought would be creating two work items for the following kernel:
__kernel void test (__global float* A, __global float* B, __global float* C,
__global float* D, int n, float C0, float D0) {
int i, j=get_global_id(0);
if (j==0) {
C[0] = C0;
for (i=1;i<=n-1;i++) {
C[i] = function1(C[i-1]);
[WAIT FOR W.I. 1 TO FINISH CALCULATING D[i]];
}
return;
}
else {
D[0] = D0;
for (i=1;i<=n-1;i++) {
D[i] = function2(C[i-1],D[i-1]);
[WAIT FOR W.I. 0 TO FINISH CALCULATING C[i]];
}
return;
}
}
Ideally each of the two work items (numbers 0,1) would do one initial comparison and then enter their respective loop, synchronizing for each iteration. Now given the SIMD implementation of GPUs, I assume that this will NOT work (work items would be waiting for all of the kernel code), however is it possible to assign this type of work to two CPU cores and have it work as expected? What will the barrier be in this case?