CUDA: When can someone achieve coalescing memory?

Question

I have trouble understanding this concept. I've researched a lot online and the only thing I understood is that threads need to access consecutive data.

So if we have an array of 10000 integers, if thread i accesses i-th number of the array, then the memory will be coalescing.

What if instead of having 10000 threads for all the integers, we decide to have 500 threads where each thread accesses two consecutive integers? Will memory coalescing be possible in this case?

And what if we decide to allow a thread to access more than 2 values, for example 10?

How would memory coalescing behave in this case? And when does "consecutive access" stop being "consecutive" in the example I described above?

Thank you in advance

Great answer already here: http://stackoverflow.com/questions/5041328/cuda-coalesced-memory - Its not that the thread indexes must be the same as memory indexes. Consecutive threads just need to load memory that is right next to each other in the address space — xjedam, Jun 24 '13 at 21:35
thank you for your comment, however what does "right next to each other" mean in terms of actual bytes in the global memory? for example, if we have 20 elements inside the array and we create 5 threads, where each thread will work with 4 elements, will the memory that each thread is going to load be right next to each other? Because in this case, a thread loads 4 elements, so that's 4*4 = 16 bytes. So, the first element of thread `i` is 16 bytes away from the first element of thread `i+1`. The same distance applies to all other three elements. — ksm001, Jun 24 '13 at 21:47
Perhaps a [webinar](https://developer.nvidia.com/gpu-computing-webinars) may be of interest. There are various webinars that do a good treatment of coalescing with lots of examples, such as "GPU Computing using CUDA C – Advanced 1 (2010)" or "CUDA Global Memory Usage & Strategy + Live Q&A with Dr Justin Luitjens, NVIDIA " - It would be an hour well spent if you want to understand it well. SO isn't really well designed for tutorials or sequences of questions and follow-up questions. — Robert Crovella, Jun 24 '13 at 21:53
@ksm001 Have you read the [Compute Capabilities](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities) section of the CUDA C Programming Guide? The memory subsystem of CC 1.1, 1.2-1.3, 2.*, and 3.* vary quite a bit and the logic is different in each case. The aforementioned link is heavily targeted at CC 1.* architectures. NOTE: Nsight VSE Memory Transactions experiment will show a histogram of # of transactions per source line to help identify bad access patterns. First, you have to understand how memory instructions are converted to transactions. — Greg Smith, Jun 24 '13 at 21:54
My answer [here](http://stackoverflow.com/questions/13834651/cuda-compute-capability-2-0-global-memory-access-pattern/13835373#13835373) may also be of interest. — Robert Crovella, Jun 24 '13 at 21:56

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

I have trouble understanding this concept

It's not something that can be thoroughly covered in a short description, especially with all the clarification questions that are likely to occur to you.

My suggestion is to take one of these webinars:

GPU Computing using CUDA C – Advanced 1 (2010)

CUDA Global Memory Usage & Strategy + Live Q&A with Dr Justin Luitjens, NVIDIA

Then come back when you have specific questions that are based on a general understanding of the topic.

CUDA: When can someone achieve coalescing memory?

1 Answers1