15

I would like to understand how to correctly use the async_work_group_copy() call in OpenCL. Let's have a look on a simplified example:

__kernel void test(__global float *x) {
  __local xcopy[GROUP_SIZE];

  int globalid = get_global_id(0);
  int localid = get_local_id(0);
  event_t e = async_work_group_copy(xcopy, x+globalid-localid, GROUP_SIZE, 0);
  wait_group_events(1, &e);
}

The reference http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/async_work_group_copy.html says "Perform an async copy of num_elements gentype elements from src to dst. The async copy is performed by all work-items in a work-group and this built-in function must therefore be encountered by all work-items in a workgroup executing the kernel with the same argument values; otherwise the results are undefined."

But that doesn't clarify my questions...

I would like to know, if the following assumptions are correct:

  1. The call to async_work_group_copy() must be executed by all work-items in the group.
  2. The call should be in a way, that the source address is identical for all work-items and points to the first element of the memory area to be copied.
  3. As my source address is relative based on the global work-item id of the first work-item in the work-group. So I have to subtract the local id to have the address identical for all work-items...
  4. Is the third parameter really the number of elements (not the size in bytes)?

Bonus questions:

a. Can I just use barrier(CLK_LOCAL_MEM_FENCE) instead of wait_group_events() and ignore the return value? If so, would that be probably faster?

b. Does a local copy also make sense for processing on CPUs or is that overhead as they share a cache anyway?

Regards, Stefan

SDwarfs
  • 3,189
  • 5
  • 31
  • 53

1 Answers1

12

One of the main reasons for this function existing is to allow the driver/kernel compiler to efficiently copy the memory without the developer having to make assumptions about the hardware.

You describe what memory you need copied as if it were a single-threaded copy, and async_work_group_copy gets it done for you using the parallel hardware.

For your specific questions:

  1. I have never seen async_work_group_copy used by only some of the work items in a group. I always assumed this is because it it required. I think the blocking nature of wait_group_events forces all work items to be part of the copy.

  2. Yes. Source (and destination) addresses need to be the same for all work items.

  3. You could subtract your local id to get the correct address, but I find that basing the address on groupId solves this problem as well. (get_group_id)

  4. Yes. The last param is the number of elements, not the size in bytes.

a. No. The event-based you will find that your barrier is hit almost immediately by the work items, and the data won't necessarily be copied. This makes sense because some opencl hardware might not even use the compute units at all to do the actual copy operation.

b. I think that cpu opencl implementations might guarantee L1 cache usage when you use local memory. The only way to know for sure if this performs better is to benchmark your application with various settings.

mfa
  • 5,017
  • 2
  • 23
  • 28
  • Thank you! Regarding 3. get_group_id() - nice hint, but in this special would need a multiplication by the size of the work-group even for known work-group sizes of size 2^n this could be sped up by a shift-left... However, in other cases this might be very useful to know it also exists! Thanks for sharing your experience! --- Stefan – SDwarfs Mar 21 '13 at 15:05
  • I have a question regarding your answer to "a": As far as I understand, the copying should be processed by at least one work-item. As all other nodes in the work-group need the local data to be ready, it makes sense to me, that they all wait for the data to be available. Waiting for the event should do exactly that. But a barrier should ensure the same, as all other work-items of the group would wait for the work-items that are engaged in the copying process. And a barrier might be simpler (at most 1 barrier per work group vs. at most 1 event wait per work item) and thus faster. – SDwarfs Mar 21 '13 at 15:21
  • 1
    re: 'a'... There could exist a device in the future that handles the async copy without using any work items. the barrier would work if at least one work item were held up by the copy operation. – mfa Mar 21 '13 at 18:31
  • 1
    You can still write your own loop to do the memory copy; this would be barrier-friendly, and allow you to do some light processing along with the copy. (matrix transpose, for example) – mfa Mar 21 '13 at 18:33
  • Async copy without activity in work-items? Doesn't make sense on the first glance. BUT on the second it -does-! You're totally right! That's the kind of answers I like! There seems to be real experience behind it and a feeling for changes in future. Matrix transpose during copy? Hm, interesting idea too... But for now I probably have to cope with the more basic stuff ;-) – SDwarfs Mar 21 '13 at 18:45
  • Here is an interesting blog post about the async_work_group_copy() method: http://streamcomputing.eu/blog/2014-06-19/using-async_work_group_copy-on-2d-data/ – Jacko Jul 17 '16 at 20:53