Overlapping transfers and device computation in OpenCL

Question

I am a beginner with OpenCL and I have difficulties to understand something. I want to improve the transfers of an image between host and device. I made a scheme to better understand me.

Top: what I have now | Bottom: what I want HtD (Host to Device) and DtH ( Device to Host) are memory transfers. K1 and K2 are kernels.

I thought about using mapping memory, but the first transfer (Host to Device) is done with the clSetKernelArg() command, no ? Or do I have to cut my input image into sub-image and use mapping to get the output image ?

Thanks.

Edit: More information

K1 process mem input image. K2 process output image from K1.

So, I want to transfer MemInput into several pieces for K1. And I want to read and save on the host the MemOuput processed by K2.

+1 for the process you've used in order to make the diagram :) — Hristo Iliev, Sep 13 '12 at 12:00

Nippey · Accepted Answer · 2013-07-03T06:53:29.743

7

As you may have already seen, you do a transfer from host to device by using clEnqueueWriteBuffer and similar.

All the commands having the keyword 'enqueue' in them have a special property: The commands are not executed directly, but when you tigger them using clFinish, clFlush, clEnqueueWaitForEvents, using clEnqueueWriteBuffer in blocking mode and some more.

This means that all action happens at once and you have to synchronise it using the event objects. As everything (may) happen at once, you could do something like this (Each point happens at the same time):

Transfer Data A
Process Data A & Transfer Data B
Process Data B & Transfer Data C & Retrive Data A'
Process Data C & Retrieve Data B'
Retrieve Data C'

Remember: Enqueueing Tasks without Event-Objects may result in a simultaneous execution of all enqueued elements!

To make sure that Process Data B doesn't happen before Transfer B, you have to retrieve an event object from clEnqueueWriteBuffer and supply it as an object to wait for to f.i. clEnqueueNDRangeKernel

cl_event evt;
clEnqueueWriteBuffer(... , bufferB , ... , ... , ... , bufferBdata , NULL , NULL , &evt);
clEnqueueNDRangeKernel(... , kernelB , ... , ... , ... , ... , 1 , &evt, NULL);

Instead of supplying NULL, each command can of course wait on certain objects AND generate a new event object. The parameter next to last is an array, so you can event wait for several events!

EDIT: To summarise the comments below Transferring data - What command acts where?

       CPU                        GPU
                            BufA       BufB
array[] = {...}
clCreateBuffer()  ----->  [     ]              //Create (empty) Buffer in GPU memory *
clCreateBuffer()  ----->  [     ]    [     ]   //Create (empty) Buffer in GPU memory *
clWriteBuffer()   -arr->  [array]    [     ]   //Copy from CPU to GPU
clCopyBuffer()            [array] -> [array]   //Copy from GPU to GPU
clReadBuffer()    <-arr-  [array]    [array]   //Copy from GPU to CPU

* You may initialise the buffer directly by providing data using the host_ptr parameter.

edited Jul 03 '13 at 06:53

answered Sep 12 '12 at 13:48

Nippey

4,708
36
44

Ok. But my problem is the transfer of a buffer (an 2D image in my case). An example: on my Host, i have a buffer that represents an array of 100 elements. But I want to send 10 elements at a time on the device and once done, I agree, I sync to start my kernel. I want to avoid the complete transfer of the buffer at the beginning (see scheme). – Alex Placet Sep 12 '12 at 14:01
Two of the parameters I just wrote as dots can receive an array of sizes, one sets the range, the other sets the offset of the area you want to copy. I don't have the doc available yet, but I guess that's exactly what you want! ;) – Nippey Sep 12 '12 at 14:53
1

Like this: `size_t origin[] = {0,0,0}, region[] = {2,2,1}; clEnqueueWriteImage(queue, image, CL_FALSE, origin, region, 0, 0, yourHostData, events_to_wait_for, new_event)` Now change the origin and call it again. This way you can transfer subsets of your buffer. Make sure that you also offset the pointer to `yourHostData` appropriately. (Note: The origin and offset have to be 3-dimensional. `region[2]` must be `1` for a 2D-image – Nippey Sep 13 '12 at 06:19
Stack Exchange notifies me about a comment that I cant see here. I try to interpret the preview from my inbox ;) If you want to transfer data from the host to the device or vice versa, you use `clEnqeueuWrite...` or `clEnqueueRead...`. If you want to transfer data from within the device to another location within the device you use `clEnqueueCopy...` – Nippey Sep 13 '12 at 08:59
yourHostData can't be a cl_mem object from a previous image ? I have an error when a try that. And how I can offset that pointer ? ` error = clEnqueueWriteImage( commandQueue, input_image, CL_TRUE, origin1, region1, image_row_pitch, image_slice_pitch, loaded_image, 0, NULL, NULL ); ` loaded_image is a cl_mem image-2D. – Alex Placet Sep 13 '12 at 09:03
I think I have difficulty to understand how to use the pointer. loaded_image is a cl_mem where the image data are and input_image the cl_mem used by kernels. – Alex Placet Sep 13 '12 at 09:16
You can't manipulate the pointers of cl-objects, only of host-data-arrays! If you have a cl_image and want to transfer it to another cl_image, you have to use `clEnqueueCopyImage`. And there you do not offset the pointer, you can only change the origin and region parameter. – Nippey Sep 13 '12 at 09:16
Maybe you should refine your question, I understood, that you have an array of some data in your CPU area and want to transfer it piece by piece to the GPU, but now I understand it, as if everything is already on the GPU :) – Nippey Sep 13 '12 at 09:23
But `clEnqueueCopyImage` transfer on Device To Device ? I understand, I will use an array to store data before transfer to device in cl_mem. Thanks a lot ! – Alex Placet Sep 13 '12 at 09:36
You understood. At the beginning of my program, I load an image from my hard drive in a cl_mem. I just want to transfer this cl_mem image2D piece by piece to the GPU. – Alex Placet Sep 13 '12 at 09:41
If it is in a cl_mem, then it is already on the GPU, as cl_mem objects are always created in the memory of your CL-device – Nippey Sep 13 '12 at 09:50
Ok I think I have found. So, when I create image 2D on my host (from data on my hard disk), i don't specify any flag.(1) After i use clEnqueueCopyImage to copy this cl_mem to another on my GPU. NVIDIA Visual Profiler display what I want (see bottom of my scheme). – Alex Placet Sep 13 '12 at 09:51
1

Sorry, didn't understand this completely. My Workflow: **-#-** Read image from HDD to [integer|char]-Array **-#-** Create Buffer on GPU **-#-** Transfer Array to Buffer (If required, you can do this partwise) **-#-** Work with it. – Nippey Sep 13 '12 at 09:55
cl_mem are always created in the memory on my GPU ? Ok ... I did not understand it, I thought it had to be done explicitly (clEnqueue, setKernelArg, etc ...) – Alex Placet Sep 13 '12 at 09:58

score 3 · Answer 2 · answered Oct 15 '12 at 00:13

3

Many OpenCL platforms don't support out-of-order command queues; the way most vendors say to do overlapped DMA and compute is to use multiple (in-order) command queues. You can use events to ensure dependencies are done in the right order. NVIDIA has example code that shows overlapped DMA and compute doing it this way (although it is suboptimal; it can go slightly faster than they say it can).

answered Oct 15 '12 at 00:13

Dithermaster

6,223
1
12
20

Thanks for your answer. In my case, the OpenCL kernel computation is short. The data transfer take a lot of time and this is on this point that I want to gain time. I tried that on a NVIDIA GTX 260, but this graphic card is incompatible with data transfer/transfer overlap ( only compatible with compute/data transfer overlap). In NVIDIA OpenCL Best Practices Guide, we can read: >NVIDIA devices with compute capability >= 2.0 possess 2 independent copy engines and are capable of simultaneous copy in 2 directions concurrent with device computation. And the GTX 260 compute capability = 1.3 ... – Alex Placet Oct 15 '12 at 06:51
[Dithermaster](https://stackoverflow.com/users/1745695/dithermaster) could you point to an example of a working implementation of the approach that you are mentioning? – user3116936 Oct 11 '18 at 08:18
See "OpenCL Overlapped Copy/Compute Sample" example on this page: https://developer.nvidia.com/opencl – Dithermaster Oct 11 '18 at 22:36

score 2 · Answer 3 · answered Sep 12 '12 at 18:45

When you create your command queue, you need to enable out-of-order execution in your properties. see: CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, clCreateCommandQueue.

This will let you set up your smaller chains of tasks and link them to each other. This is all done on the host.

host pseudo code:

for i in taskChainList
  enqueueWriteDataFromHost
  enqueueKernel(K1)
  enqueueKernel(K2)
  enqueueReadFromDevice
clfinish

When you are queueing the tasks, put the previous cl_event into each task's event_wait_list. The 'enqueueWriteDataFromHost' I have above wouldn't have to wait for another event to begin.

Alternately,

cl_event prevWriteEvent;
cl_event newWriteEvent;
for i in taskChainList
  enqueueWriteDataFromHost // pass *prevWriteEvent as the event_wait_list, and update with newWriteEvent that the enqueue function produces. Now each Write will wait on the one before it.
  enqueueKernel(K1)
  enqueueKernel(K2)
  enqueueReadFromDevice  //The reads shouldn't come back out of order, but they could (if the last block of processing were much faster then the 2nd-last for example)
clfinish

score 2 · Answer 4 · answered Jan 16 '13 at 23:02

The proper way (as I do and does work perfectly) is to create 2 command queues, one for I/O and another for processing. Both must be in the same context.

You can use events to control the schedule of both queues, and the operations will execute in parallel (if they can). Even if the device does not support outoforderqueue it does indeed work.

For example, you can enqueue all the 100 images in the I/O queue to the GPU and get their events. Then set this events as the trigger for the kernels. And the DtoH transfer is triggered by the kernel events. Even if you enqueue all this jobs AT ONCE, they will be processed in order and with parallel I/O.

Overlapping transfers and device computation in OpenCL

4 Answers4

Linked