My OpenCL program (don't be scared, this is auto-generated code for 3D CFD) shows strange behavior -- a lot of time are spent in opencl_enq_job_* procedures (opencl_code.c), where are only async OpenCL commands:
clEnqueueWriteBuffer(..,CL_FALSE,...,&event1);
clSetKernelArg(...);
...
clEnqueueNDRangeKernel(...,1,&event1,&event2);
clEnqueueReadBuffer(...,CL_FALSE,...,1,&event2,&event3);
clSetEventCallback(event3,...);
clFlush(...);
In program output the time, spent in opencl_enq_job_* shown as:
OCL waste: 0.60456248727985751
It's mean 60% of time wasted in that procedures.
Most of time (92%) are spent in clEnqueueReadBuffer function and ~5% in clSetEventCallback.
Why so much? What's wrong in this code?
My configuration:
Platform: NVIDIA CUDA
Device 0: Tesla M2090
Device 1: Tesla M2090
Nvidia cuda_6.0.37 SDK and drivers.
Linux localhost 3.12.0 #6 SMP Thu Apr 17 20:21:10 MSK 2014 x86_64 x86_64 x86_64 GNU/Linux
Update: Nvidia accepted this as a bug.
Update1: On my laptop (MBP15, AMD GPU, Apple OpenCL) the program show similar behavior, but waiting more in clFlush (>99%). On CUDA SDK the program works without clFlush, on Apple program without clFlush hangs (submitted tasks never finishes).