0

I'm working with an 3D array of dimension xdim=49, ydim=1024 and zdim=64. my DEVICE_MAX_WORK_ITEM_SIZES is only 512/512/512. If I declare my

size_t global_work_size = {xdim, ydim, zdim}; and launch an 3D kernel,

I'm getting wrong results since my ydim > 512. If all my dimensions are below 512, I'm getting the expected results. Please let me know if there's an alternative for this?

Avis
  • 988
  • 2
  • 11
  • 31

2 Answers2

1

Assuming the dimensions you provided are the size of your data, you can decrease the global work size by making each GPU thread calculate more data. What I mean is, every thread in your case will do one calculation and if you change your kernels to do let's say 2 calculations in y dimension, than you could cut the number of threads you are firing into half. The global_work_size decides how many threads in each direction you are executing. Let me give you an example:

Let's assume you have an array you want to do some calculations with and the array size you have is 2048. If you write your kernel in the following way, you are going to need 2048 as the global_work_size:

__kernel void calc (__global int *A, __global int *B)
{
  int i = get_global_id(0);
  B[i] = A[i] * 5;
}

The global work size in this case will be:

size_t global_work_size = {2048, 1, 1};

However, if you change your kernel into the following kernel, you can lower your global work size as well: ()

__kernel void new_calc (__global int *A, __global int *B)
{
  int i = get_global_id(0);
  for (int ind = 0; ind < 8; ind++)
    B[i*8 + ind] = A[i*8 + ind] * 5;
} 

Then this way, you can use global size as:

size_t global_work_size = {256, 1, 1};

Also with the second kernel, each of your threads will execute more work, resulting in more utilisation.

parallel highway
  • 354
  • 2
  • 12
1

CL_DEVICE_MAX_WORK_ITEM_SIZES only limits the size of work groups, not the global work item size (yea, it's a terrible name for the constant). You are much more tightly restricted by CL_DEVICE_MAX_WORK_GROUP_SIZE which is the total number of items allowed in a work group (you'd typically hit this far sooner than CL_DEVICE_MAX_WORK_ITEM_SIZES because of multiplication.

So go ahead an launch your global work size of 49, 1024, 64. It should work. If it's not, you're using get_local_id instead of get_global_id or have some other bug. We regularly launch 2D kernels with 4096 x 4096 global work size.

See also Questions about global and local work size

If you don't use shared local memory, you don't need to worry about local work group sizes. In fact, you can pass NULL instead of a pointer to an array of sizes for local_work_size and let the runtime pick something (it helps if your global dimensions are easily divisible by small numbers).

Dithermaster
  • 6,223
  • 1
  • 12
  • 20