OpenCL: too slow async commands submiting

Question

My OpenCL program (don't be scared, this is auto-generated code for 3D CFD) shows strange behavior -- a lot of time are spent in opencl_enq_job_* procedures (opencl_code.c), where are only async OpenCL commands:

clEnqueueWriteBuffer(..,CL_FALSE,...,&event1);
clSetKernelArg(...);
...
clEnqueueNDRangeKernel(...,1,&event1,&event2);
clEnqueueReadBuffer(...,CL_FALSE,...,1,&event2,&event3);
clSetEventCallback(event3,...);
clFlush(...);

In program output the time, spent in opencl_enq_job_* shown as:

OCL waste: 0.60456248727985751

It's mean 60% of time wasted in that procedures.

Most of time (92%) are spent in clEnqueueReadBuffer function and ~5% in clSetEventCallback.

Why so much? What's wrong in this code?

My configuration:

Platform: NVIDIA CUDA
Device 0: Tesla M2090
Device 1: Tesla M2090

Nvidia cuda_6.0.37 SDK and drivers.
Linux localhost 3.12.0 #6 SMP Thu Apr 17 20:21:10 MSK 2014 x86_64 x86_64 x86_64 GNU/Linux

Update: Nvidia accepted this as a bug.

Update1: On my laptop (MBP15, AMD GPU, Apple OpenCL) the program show similar behavior, but waiting more in clFlush (>99%). On CUDA SDK the program works without clFlush, on Apple program without clFlush hangs (submitted tasks never finishes).

Who can say what's wrong with 3k lines of auto-generated code? Did you try smaller programms? Is this behavior specific only to one application, or API calls are generally slow? — Roman Arzumanyan, Jul 29 '14 at 18:59
I have localized a place, where problem are exists, just look at opencl_enq_job_* procedures in opencl_code.c, the procedures are small and contains very simple code. — Pavel, Jul 29 '14 at 19:22
Most likely the call to `clEnqueueReadBuffer` is blocking (most are, since you need the results), so it's parked there waiting for everything async to finish. If you want to be concurrently doing things on the CPU, do it in another thread, or use an event callback to know when the OpenCL work is done. — Dithermaster, Jul 30 '14 at 14:52
clEnqueueReadBuffer is unblocking (invoked with CL_FALSE) and I can see unfinished tasks running in parallel. I have submitted this as a bug to Nvidia, they reproduced and accepted it. — Pavel, Jul 30 '14 at 17:39

score 1 · Accepted Answer · edited Jul 31 '14 at 14:53

1

I have tried memory pining and it significantly improved the situation!

Problem was solved.

I think this is not really a bug; I just missed something in the documentation. My investigation lead me to the conclusion, that driver just cannot perform async load/store of non-pinned buffer -- even if non-blocking calls are used. The driver just waits for an opportunity to store/load data, which can be performed only after task finish, and this breaks parallelism.

edited Jul 31 '14 at 14:53

mfa

5,017
2
23
28

answered Jul 30 '14 at 19:48

Pavel

363
1
2
14

1

It will depend on the driver/HW. For many systems, asyncs do work even on non-pinned memory. BTW: The profiler sometimes assumes the time used is by `clEnqueueRedBuffer()` just because it is the last call pending before it can continue the execution. So, I wouldn't trust any C profiler (they don't supose an underlying HW is blocking the execution), better use the OpenCL profiler events. – DarkZeros Aug 01 '14 at 10:15

OpenCL: too slow async commands submiting

1 Answers1