0

I have an input image "let it be a buffer of 1024 * 1024 pixels, with RGBA color data"

what I want to do for each pixel, is to filter it depending on neighbors , like [-15,15] in x and y directions

so my concern is, doing this with global memory will do like 31 * 31 global memory access for each pixel "which would be very performance bottleneck" , also I'm not sure about the behavior of multiple threads trying to read from the same memory location at the same time "may be some of them fail to read so -> rubbish data in -> rubbish data out"

this question is for CUDA or OpenCL as the concept should be the same I know that shared memory (per work group) or local memory (per thread) won't solve this as I can't read another thread local memory, or another group shared memory "correct me if I misunderstand this concept"

Mohamed Sakr
  • 409
  • 4
  • 16

1 Answers1

2

Shared memory is a typical approach to this problem, although the stencil area (31*31) is quite large. Data re-use benefit can still be gained however. Since adjacent pixel computations only extend the region required by one column, in a 16KB shared memory array of 32bit RGBA pixels, you could have enough data for at least 64 threads to cooperatively compute their pixel values out of a single shared memory load.

Regarding the concern about multiple threads reading the same location, there is no possibility for garbage data reads. Certainly there is a possibility for contention leading to a performance impact, but in fact with an orderly for-loop progression in the kernel, no threads will be reading the same location at the same time anyway. With appropriate data organization there will be good opportunity for coalesced reads from global memory and no bank conflicts in shared memory.

This type of problem is well-suited for GPUs e.g. CUDA or OpenCL, and there are many examples of programs like this on SO.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • let the groups threads arrange like [0~31] [0~31] ... how can thread 20 in the first group access shared memory of thread 20 in the second group? as the window is around each thread "so I have to access 15 threads before and 15 threads after" – Mohamed Sakr Apr 10 '14 at 02:34
  • Shared memory is not associated with an individual thread or thread group. It is associated with a threadblock, i.e. all the threads in a threadblock. Any thread in a threadblock can access the shared memory associated with that threadblock. – Robert Crovella Apr 10 '14 at 02:36
  • I know this!!, all the 32*32 threads inside the block has access to the shared memory in this block, but the problem here for example: global memory ID pixel(420,390) want to access pixels from range (420-15,390-15) to (420+15,390+15) , so if we store global memory into shared memory, only the center thread in the block will have its data ready, and all other threads will need to access neighbor blocks shared memory "which is impossible" – Mohamed Sakr Apr 10 '14 at 02:42
  • Yes, so you don't just load 32*32 data elements into shared memory. You load 32*32 data elements plus one additional column (32 elements) for each additional thread. The remainder of the data required by that additional thread is the same data you loaded for the previous thread. And I'm not suggesting a block of 32*32 threads. I'm suggesting a block of maybe 64 threads, total. For 64 threads, you would need 32*32 pixels, plus 32*63 pixels, approximately, in shared memory. That is certainly doable, and will provide enough data for all 64 threads to compute the values of 64 adjacent pixels – Robert Crovella Apr 10 '14 at 02:45
  • well your last answer confused me more "I can't figure out how it will look like" , may be 5 lines of code would let me understand: kernel<<>>(pixels) , how you will copy from global to shared, for loop access pattern – Mohamed Sakr Apr 10 '14 at 02:59
  • 1
    Here's a [fully worked example code](http://stackoverflow.com/questions/14920931/3d-cuda-kernel-indexing-for-image-filtering/14926201#14926201). It happens to be in 3D instead of 2D, but the concepts are identical. – Robert Crovella Apr 10 '14 at 03:07
  • 1. If you use Images, the texture cache may help and is much simpler to code for than using shared local memory. 2. If your convolution is separable (like a Gaussian) then you can decompose into two 1D passes which is much fewer memory accesses overall and therefore faster. – Dithermaster Apr 19 '14 at 16:42