Opencl kernel launch overhead and divide&conquer algorithms for 2M-20M elements

Question

Clarification: I was hardcoding parameters of opencl computing functions (such as ranges of kernels, devices and explicit copies) into each project of mine but this got tiring and I decided to write a fully dynamic class that does divide and conquer on arrays and sends pieces to different devices. For now, it can build a list of devices capable of opencl and can divide a work using multiple threads, compute with another thread pool and re-assemble using a third thread pool. (example: 1 thread dividing, 2 threads computing on 2 devices, 1 thread assembling results into original array ). Threads reside in threadpools of java's executors.

Problem: When it divides a 2M-element work into 256 pieces (each computing 8k elements), kernel launch overheads(on gpu) become more than %100 of execution times. When I increase each piece's size (to ~64k or 128k), it becomes non-performance aware such as getting too much work onto a weak device and slows down even more. There is also possibility of non-equal workloads on each element (raytracing, tree structuring for particle collisions, variable loop iterations ...). My gpu can execute just two inndependent kernels at the same time so I cant hide divide&conquer pieces' read-compute-write successfully. If I simply divide all work into two pieces, it gets even worse. When number of elements reach to 20M, kernel launch overhead is going to be a lot visible.

Question: Should I copy whole array to all devices only once (instead of pieces)(10 copies for 10 devices, can hide some using overlapped computes&copies) and compute only once on a contiguous ndrangekernel execution or copy a minimal sized part to a device then test its performance then copy a variable length of array of elements based on that micro benchmark? How can I balance between number of executions versus performance-awarenessicity to achieve real-time speed for raytracing, particle colliding and sorting algorithms so my cpu can actually help gpu instead of decreasing performance?

For now, it looks like:

double[]a=new double[8192*256];
double[]b=new double[8192*256];
for(int i=0;i<a.length;i++)
{
    a[i]=(double)(0.1*i);
    b[i]=0;

}

 AccContext acc=new  AccContext("gpu cpu",new Object[]{a,b},Kernels.Trigonometry());

 // balanced work
 // here, Sin(a) results in b array. 
 acc.Compute("Sine"); 

 // b[i]=Cos of Cos of ....  Cos(Sin(a[i])); for each ith element of array.
 for(int k=0;k<10;k++)
     acc.Compute("Cosine"); 


 // non-balanced work
 // repeats taking sinuses for i times for each a[i]
 acc.Compute("Sine_i_times");

System: windows-7 64 bit + Opencl 1.2 capable gpu & cpu, jocl & java 64 bit. So I cannot use Opencl 2.0.

Is there any reference/info about the `AccContext` and `Kernels.Trigonometry` classes? It's not entirely clear whether the "overhead" that you are talking about comes from the Java side, or from the actual, native `clEnqueueND...` call... — Marco13, Feb 22 '15 at 12:07
It coming from both java side and opencl native side but the opencl native side is not optimizable (and is about 0.5 ms per execution). — huseyin tugrul buyukisik, Feb 22 '15 at 12:11
Show some profiling results, it looks like you have too many variables to distinguish the problem. — Roman Arzumanyan, Feb 27 '15 at 16:29

Opencl kernel launch overhead and divide&conquer algorithms for 2M-20M elements

0 Answers0