2

I'm trying to copy global to local memory in OpenCL.

I use "async work group copy" instruction for copying data from global memory to local memory .

__local float gau2_sh[1024];
event_t tevent = (event_t)0;
__local float gau4_sh[256];
tevent = async_work_group_copy(gau2_sh, GAU2, 1024, tevent);
tevent = async_work_group_copy(gau4_sh, GAU4, 256, tevent);
wait_group_events(2, &tevent);

Global memory size of gau2 is 1024 * 4. When I use less than 128 threads, it works fine. But if I use more than 128 threads, kernel results in error CL_INVALID_WORK_GROUP_SIZE.

My GPU is an Adreno420, where the maximum work group size is 1024.

Do I need to consider other thing for local memory copy?

user703016
  • 37,307
  • 8
  • 87
  • 112
eclipse0922
  • 158
  • 2
  • 15

1 Answers1

1

It is caused by register usage and local memory.

Similarly to -cl-nv-maxrregcount=<N> of CUDA, for Qualcomm Adreno series, they have compile option for reducing register usage. .

The official document related with this thing is proprietary. So if you concerned about it, please read document included in Qualcomm Adreno SDK.

For the details, please refer to the following links:

  1. Using a barrier causes a CL_INVALID_WORK_GROUP_SIZE error
  2. Questions about global and local work size
  3. Qualcomm Forums - Strange Behavior With OpenCL on Adreno 320
  4. Mobile Gaming & Graphics (Adreno) Tools and Resources
Community
  • 1
  • 1
eclipse0922
  • 158
  • 2
  • 15
  • 1
    Oh, sorry, I had the answer when I edited your question and forgot to reply :( Upvoted. You can accept your own answer by clicking the tick on the left. – user703016 May 14 '15 at 11:00