I want to copy some data from a buffer in the global device memory to the local memory of a processing core - but, with a twist.
I know about async_work_group_copy, and it's nice (or rather, it's klunky and annoying, but working). However, my data is not contiguous - it is strided, i.e. there might be X bytes between every two consecutive Y bytes I want to copy.
Obviously I'm not going to copy all the useless data - and it might not even fit in my local memory. What can I do instead? I want to avoid writing actual kernel code to do the copying, e.g.
threadId = get_local_id(0);
if (threadId < length) {
unsigned offset = threadId * stride;
localData[threadId] = globalData[offset];
}