I'm trying to copy global to local memory in OpenCL.
I use "async work group copy" instruction for copying data from global memory to local memory .
__local float gau2_sh[1024];
event_t tevent = (event_t)0;
__local float gau4_sh[256];
tevent = async_work_group_copy(gau2_sh, GAU2, 1024, tevent);
tevent = async_work_group_copy(gau4_sh, GAU4, 256, tevent);
wait_group_events(2, &tevent);
Global memory size of gau2
is 1024 * 4. When I use less than 128 threads, it works fine. But if I use more than 128 threads, kernel results in error CL_INVALID_WORK_GROUP_SIZE
.
My GPU is an Adreno420, where the maximum work group size is 1024.
Do I need to consider other thing for local memory copy?