This seems to be an old post, but let me update this post with some new information. Hopefully, it could help someone else.
Does Global Work Size (Dimensions) Need to be Multiple of Work Group
Size (Dimensions) in OpenCL?
Answer: True till OpenCL 2.0. Before CL2.0, your global work size must be a multiple of local work size, otherwise you will get an error message when you execute clEnqueueNDRangeKernel.
But from CL2.0, this is not required anymore. You can use whatever global work size which fits your application dimensions. However, please remember that the hardware implementation might still use the "old" way, which means padding the global work group size. Therefore, it makes the performance highly dependent on the hardware architecture. You may see quite different performance on different hardware/platforms. Plus, you want to make your application back compatible to support older platform which only supports CL up to version 1.2. So, I think this new feature added in CL2.0 is just for easy programming, to get better controllable performance and backward compatibility, I suggest you still use the following method mentioned by you:
Increase the dimensions of the global work to be the nearest multiple
of the work group dimensions, keeping all input and output buffers the
same but checking bounds in the kernel to avoid segfaulting, i.e. do
nothing on the work items out of bound of the desired output. (This
seems like the better way.)
Answer: you are absolutely right. This is the right way to handle such case. Carefully design the local work group size (considering factors such as register usage, cache hit/miss, memory access pattern and so on). And then pad your global work size to a multiple of local work size. Then, you are good to go.
Another thing to consider is that you can utilize the image object to store the data instead of buffer, if there are quite a lot of boundary checking work in your kernel. For image, the boundary check is automatically done by hardware, almost no overhead in most of the implementations. Therefore, padding your global work size, store your data in image object, then, you just need to write your code normally without worrying about the boundary checking.