Why OpenCL work group size has huge performance impact on GPU?

Question

I am benchmarking a simple matrix transposition kernel on Qualcomm Adreno 630 GPU, and I am trying to see the impact of different work group size, but surprisingly, I get some interesting result which I cannot explain. Here is my kernel code:

__kernel void transpose(__global float *input, __global float *output, const int width, const int height)
    int i = get_global_id(0);
    int j = get_global_id(1);
    output[i*height + j] = input[j*width + i];
}

and the width and height are both 6400, the experiment results are(execution time is the difference between END and START event):

work group size      execution time
x     y
4    64              24ms
64   4               169ms
256  1               654ms
1    256             34ms
8    32              27ms
1    1024            375ms
1024 1               657ms
32   32              26ms

after this I did another experimemnt where I change the width and height from 6400 to 6401(and the global work size in NDRangeKernel call as well), and the result is even more interesing:

work group size      execution time
x     y
4    64              28ms
64   4               105ms
256  1               359ms
1    256             31ms
8    32              32ms
1    1024            99ms
1024 1               358ms
32   32              32ms

execution time of most scenarios drops significantly. I know memory coalescing or cache could play a role here, but I cannot completely explain this.

score 2 · Answer 1 · answered Aug 07 '20 at 16:05

Memory coalescence occurs when consecutive threads access data at consecutive global memory addresses within a 128-byte aligned segment. Then memory accesses are coalesced into one, significantly reducing overall latency.

In the 2D range, coalescing only happens along get_global_id(1) or the j direction in your case. In the line output[i*height + j] = input[j*width + i];, input[j*width + i]; is a misaligned (non-coalesced) read and output[i*height + j] is a coalesced write. Coalesced memory access generally is much faster than misaligned access, but the performance penalty for coalesced/misaligned reads can be vastly different than coalesced/misaligned writes. On most desktop GPU architectures, the combination misaligned read and coalesced write is faster than the other way around, see the diagram below. So your implementation should be the faster variant already.

Since coalesced access is only possible along the j index, if you have a range of (x=256,y=1) (i along x-direction, j along y-direction), you do not get any coalescing. For (x=8,y=32), j is coalesced in groups of 32 8 times per thread block, so memory bandwidth is fairly saturated and performance is good.

If you want maximum possible performance, I'd suggest you go with 1D indexing. This way you have full control about coalescing and coalescing happens over the entire thread block. Your matrix transpose kernel then would look like this:

#define width 6400
__kernel void transpose(__global float *input, __global float *output) {
    const int n = get_global_id(0);
    int i = n/width;
    int j = n%width;
    output[i*height + j] = input[j*width + i];
}

You can bake width into the OpenCL Ccode at C++ runtime and before OpenCL compile time via string concatenation.

Thanks for your answer, but I am still a bit confused, here are my points: 1. why does coalesing only happen along j direction? it is a bit counter-intuitive because row-major(where consecutive threads lies on the samw row) is more natural, do you have source of this info? I did another experiment with simple copy kernel(output[j * width + i] = input[j * width + i];), 256x1(x * y) gives 13ms but 1x256 gives 41ms. 2. unaligned write has more penalty than read is also counter-intuitive because read normally takes more bus time 3. how can you explain the performance with 6401 array size? — victor, Aug 10 '20 at 08:15
1. Memory is stored in 1D, `n=get_global_id(0)*get_globel_size(1)+get_global_id(1)`. For consecutive `id(1)`, the mem address `n` is consecutive, but not for con. `id(0)`. Defining rows/colums is up to you. For `256x1`, consecutive `i` have consecutive mem addresses, so you get coalescing in reads&writes. I can't explain why `1x256` is slower though, you should also get coalescence there. 2. Can't explain it, it's an observation. 3. 6401 isn't divisible by your thread block size `y!=1`; for the remaining items of the last block you read from/write to nirwana without an extra `if` in the kernel — ProjectPhysX, Aug 10 '20 at 11:06
I didn't get your point for item 3, why will this cause 2x performance gain(execution time drops 50% for almost all work group sizes)? — victor, Aug 13 '20 at 13:13
For `6400x6401` and block size `32x32` you have 200 blocks in `x` and 201 blocks in `y`direction. But the 201th block in `y`only contains 1 valid data element and 31 which are out of bounds. The compiler will not complain about this, but you are working with 31 threads in unmapped memory and this can cause lots of issues, including the code not running at all in part, which results in faster execution time, but results are random garbage. Try with `if(get_global_id(1)>4601) return;` in the beginning of the kernel, then execution time should be normal or a bit slower, but with correct results. — ProjectPhysX, Aug 13 '20 at 13:26

Why OpenCL work group size has huge performance impact on GPU?

1 Answers1

Linked