Strided copy between HOST and DEVICE clEnqueueWriteBufferRect

Question

I search for a means to transfer data from two HOST buffers into a single DEVICE buffer in the following strided way:

Below are the two host buffers Host_buffer_1 = [0 5] // copy to device with a stride equals to 5 Host_buffer_2 = [1 2 3 4 6 7 8 9] // each region of 4 numbers copy with a stride

I need the resulting Device buffer to be device buffer [0 1 2 3 4 5 6 7 8 9]

Do I have to realize it on the HOST first and then a normal transfer to the device, or do you know a means to achieve this using clEnqueueWriteBufferRect function for instance, but this function does not have any stride parameter, right?

Thanks

DarkZeros · Answer 1 · 2016-03-23T18:10:26.633

You can use cl calls to do the rectangular patch copy on the fly. However, performance wise, I am not sure if this is the right approach.

If you shape your data as a 2D:

0 1 2 3 4
5 6 7 8 9

Then the buffers map like:

Device         Host1  Host2
1 2 2 2 2      1      2 2 2 2
1 2 2 2 2      1      2 2 2 2

Therefore the copy rect commands should be:

clEnqueueWriteBufferRect(queue, buffer, CL_FALSE, 
                     {0,0,0},  //buffer_origin
                     {0,0,0},  //host_origin
                     {1,2,1},  //region
                     5*sizeof(type),  //buffer_row_pitch
                     0,  //buffer_slice_pitch
                     1*sizeof(type),  //host_row_pitch
                     0,  //host_slice_pitch
                     host1, 0, NULL, NULL);

clEnqueueWriteBufferRect(queue, buffer, CL_FALSE, 
                     {1,0,0},  //buffer_origin
                     {0,0,0},  //host_origin
                     {4,2,1},  //region
                     5*sizeof(type),  //buffer_row_pitch
                     0,  //buffer_slice_pitch
                     4*sizeof(type),  //host_row_pitch
                     0,  //host_slice_pitch
                     host2, 0, NULL, NULL);

But be VERY careful with the row_pitch and slice_pitch, as well as the offsets and regions. Since it is quite easy to get messed up. (And please check my code if you use it)

clEnqueueWriteBufferRect

A low end (or an old main stream) amd card has only 2 queues for memory transfers so copying only a few items per clEnqueue would be suboptimal. Why dont you try a kernel that accesses with a stride to a USE_HOST_PTR buffer and writes to a device side buffer (or opposite, read linear, write strided) ? ,At least that would be single instruction from host and minimize latency. I tried a divide&conquer using rectangle to feed a gpu but a simple kernel is faster at least on windows-10 catalyst 16.x.y and HD7000 series — huseyin tugrul buyukisik, Mar 23 '16 at 11:57
You are right this is not optimal. But I was guessing is just a simplified example. In any case, I would go for different buffers and coalesced access, rather than mixing data in a single buffer. — DarkZeros, Mar 23 '16 at 12:00
But if you have a fury-x then its async engines may be useful fpr many copies. — huseyin tugrul buyukisik, Mar 23 '16 at 12:56

Strided copy between HOST and DEVICE clEnqueueWriteBufferRect

1 Answers1