I have this Problem and I don't know how to solved it.
I work with 2 Cluster one with 6 Tesla C1060 and another one with 2 Tesla K20M.
I have 2 OpenCL-Program using JOCL as Java Bindings. The First one have this structure :
1 OpenCL Kernel
...code...
clEnqueueNDRangeKernel(commandQueues[i], kernel[i], 1, null,
global_work_size, local_work_size, 0, null, events[i]);
clFlush(commandQueues[i]);
This one work in both Computer Cluster with Tesla C1060 and Tesla K20M.
The Second Program have this structure :
4 OpenCL Kernel
...code...
clEnqueueNDRangeKernel(commandQueues[i], kernel1[i], 1, null,
global_work_size, local_work_size, 0, null, events[i]);
clEnqueueNDRangeKernel(commandQueues[i], kernel2[i], 1, null,
global_work_size, local_work_size, 0, null, events[i]);
clEnqueueNDRangeKernel(commandQueues[i], kernel3[i], 1, null,
global_work_size, local_work_size, 0, null, events[i]);
...code...
read result from 3rd Kernel and do a little data comparison
...code...
clEnqueueNDRangeKernel(commandQueues[i], kernel4[i], 1, null,
global_work_size, local_work_size, 0, null, events[i]);
clFlush(commandQueues[i]);
I got the expected result, but just from the Cluster with 2 Tesla K20M. From the other cluster with 6 Tesla C1060, I got the wrong result (The Programm starts and ends normal,but delivers wrong result). I've try it with only 1, 2, 3, 4, 5 TeslaC1060 and everytime I get the wrong result.
I need Help to find out if it Hardware-problem that cause this, or do I have to try to change, how the multiple kernel execution start? Maybe I have to read the result first everytime I execute the kernel and after that I send it to the next kernel ?
I'll appreciate any help.
Thank you