0

I am developing OpenCL code for different devices. At the current time I work with Rockchip RK3588 (OpenCL device - Mali-G610 r0p0). The program algorithm was originally written on CUDA, where the warp size is 32. In OpenCL this value is named "sub-work group size" (count Work-Items running in the current time). Also, this value can get from the value CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE.

For example on Intel GPU I can set this value uses __attribute__((intel_reqd_sub_group_size(32))). And now on "Mali-G610 r0p0" I get "CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE = 16", but the program work is not correct, I need to change this value to 32.

<clinfo> returned me the next info:
................
Preferred work group size multiple (device) 16
Preferred work group size multiple (kernel) 16
Max sub-groups per work group 64
................

Perhaps someone can help me with this?

  • You can't change it. It's a property of the hardware design - warps are 16 wide on Mali-G610. If your kernel is written assuming the larger warp size as a means to implement non-portable optimizations, such as skipping synchronization, then you will need to re-add the necessary barriers to synchronize multiple warps. – solidpixel Mar 30 '22 at 23:14
  • Thanks for the answer, I also guessed that this value cannot be changed programmatically. The algorithm sharpened for warp size is 32 (on CUDA). On AMD wavefront size 64, which covers the algorithm. On Intel GPU Sub-group sizes can be 8, 16, 32, and there is a way to set the desired value of 32 through a special attribute. But on Mali-platform, I encountered a warp size of 16, and it turns out that I need to rewrite the code, especially in terms of the synchronization code (barrier/barrier_group_work calls). I had hope because the value "Max sub-groups per work group 64" – Vitaliy Shemet Apr 01 '22 at 12:23

1 Answers1

2

You're interpreting the value incorrectly. The "Preferred Work Group Size Multiple" value is just that: a preference. The OpenCL API is telling you that it prefers work groups come in multiples of 16, but you can specify other sizes instead if you absolutely need to. But, lucky for you, you need a size of 32, which it is perfectly happy with.

What you actually need to do is manually specify the work group size when you enqueue the kernel to the GPU. If you don't specify it, the API will figure out for itself what the size should be, and while its guesses are usually pretty good, if you're doing work at the level of the work groups itself (very common with Reduction algorithms) you might have to manually specify it.

When you're submitting your kernel, you should be using code that resembles this:

size_t globalWorkSize[] {512};
cl_event event;

clEnqueueNDRangeKernel(
    command_queue,
    kernel,
    1, //work_dim
    nullptr, //global_work_offset
    globalWorkSize,
    nullptr, //local_work_size
    0, //num_events_in_wait_list
    nullptr, //event_wait_list
    &event
);

You need to add a parameter that explicitly specifies the work group size:

size_t globalWorkSize[] {512};
size_t localWorkSize[] {32}; //Will create 16 work groups
cl_event event;

clEnqueueNDRangeKernel(
    command_queue,
    kernel,
    1, //work_dim
    nullptr, //global_work_offset
    globalWorkSize,
    localWorkSize,
    0, //num_events_in_wait_list
    nullptr, //event_wait_list
    &event
);
Xirema
  • 19,889
  • 4
  • 32
  • 68
  • Thanks for the answer, I meant a little different. The algorithm was originally written on CUDA (warp size 32). This means that there are many places where this number is used to work with local memory and it was assumed that the synchronization functions synchronize 32 threads or work-items. On AMD wavefront size 64, which covers the algorithm. On Intel GPU Sub-group sizes can be 8, 16, 32, and there is a way to set the desired value 32 through a special attribute. – Vitaliy Shemet Apr 01 '22 at 12:45
  • But on Mali-platform, I encountered a warp size of 16, and I would like not to rewrite the code, but for example change this value to 32 to execute the kernel like in Intel. Since the "Max sub-groups per work group 64", I thought maybe there is such a way. – Vitaliy Shemet Apr 01 '22 at 12:45
  • @VitaliyShemet You may want to look into the [`clGetKernelSubGroupInfo`](https://man.opencl.org/clGetKernelSubGroupInfo.html) API function, since that directly deals with subgroups and is more likely to give you information you need to setup your code correctly. – Xirema Apr 01 '22 at 15:11
  • Thanks, I tried this functionality and got the following value: `CL_KERNEL_MAX_SUB_GROUP_SIZE_FOR_NDRANGE = 16 CL_KERNEL_SUB_GROUP_COUNT_FOR_NDRANGE = 2 CL_KERNEL_MAX_NUM_SUB_GROUPS = 64 CL_KERNEL_LOCAL_SIZE_FOR_SUB_GROUP_COUNT = 1024 1 1` Where the value of `CL_KERNEL_MAX_SUB_GROUP_SIZE_FOR_NDRANGE` is always 16, this means that the code needs to be rewritten. – Vitaliy Shemet Apr 04 '22 at 11:07