How do I make a strided copy from global to local memory?

Question

I want to copy some data from a buffer in the global device memory to the local memory of a processing core - but, with a twist.

I know about async_work_group_copy, and it's nice (or rather, it's klunky and annoying, but working). However, my data is not contiguous - it is strided, i.e. there might be X bytes between every two consecutive Y bytes I want to copy.

Obviously I'm not going to copy all the useless data - and it might not even fit in my local memory. What can I do instead? I want to avoid writing actual kernel code to do the copying, e.g.

threadId = get_local_id(0);
if (threadId < length) {
    unsigned offset = threadId * stride;
    localData[threadId] = globalData[offset];
}

@AndreasNiedermair: A valid comment, but I did ask that question 3 years ago already. — einpoklum, Sep 13 '16 at 13:40

score 4 · Accepted Answer · edited Sep 13 '16 at 13:06

You can use the async_work_group_strided_copy() OpenCL API call.

Here is a small example in pyopencl thanks to @DarkZeros' comment. Let's assume a small stripe of an RGB image, says 4 by 1 like that:

img = np.array([58, 83, 39, 157, 190, 199, 64, 61, 5, 214, 141, 6])

and you want to access the four red channels i.e. [58 157 64 214] you'd do:

def test_asyc_copy_stride_to_local(self):
    #Create context, queue, program first
     ....
    #number of R channels
    nb_of_el = 4
    img = np.array([58, 83, 39, 157, 190, 199, 64, 61, 5, 214, 141, 6])
    cl_input = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=img)
    #buffer used to check if the copy is correct
    cl_output = cl.Buffer(ctx, mf.WRITE_ONLY, size=nb_of_el * np.dtype('int32').itemsize)
    lcl_buf = cl.LocalMemory(nb_of_el * np.dtype('int32').itemsize)
    prog.asynCopyToLocalWithStride(queue, (nb_of_el,), None, cl_input, cl_output, lcl_buf)
    result = np.zeros(nb_of_el, dtype=np.int32)
    cl.enqueue_copy(queue, result, cl_output).wait()
    print result

The kernel:

kernel void asynCopyToLocalWithStride(global int *in, global int *out, local int *localBuf){
    const int idx = get_global_id(0);
    localBuf[idx] = 0;
    //copy 4 elements, the stride = 3 (RGB)
    event_t ev = async_work_group_strided_copy(localBuf, in, 4, 3, 0);
    wait_group_events (1, &ev);
    out[idx] = localBuf[idx];
}

Thank you, a detailed example like yours is missing in the official doc of the function. — DarkZeros, Aug 05 '13 at 15:19

How do I make a strided copy from global to local memory?

1 Answers1