5

I have written an application in CUDA, which uses 1kb of shared memory in each block.
Since there is only 16kb of shared memory in each SM, only 16 blocks can be accommodated overall, right? Though at a time only 8 can be scheduled, but now if some block is busy in doing memory operations, another block will be scheduled on the GPU, but all the shared memory is used by the other 16 blocks which already have been scheduled there.

So will CUDA not schedule more blocks on the same SM, unless previous allocated blocks are completely finished?

Or will it move some block's shared memory to global memory, and allocate other block there? In this case should we worry about global memory access latency?

paleonix
  • 2,293
  • 1
  • 13
  • 29
peeyush
  • 2,841
  • 3
  • 24
  • 43

1 Answers1

7

It does not work like that. The number of blocks which will be scheduled to run at any given moment on a single SM will always be the minimum of the following:

  1. 8 blocks
  2. The number of blocks whose sum of static and dynamically allocated shared memory is less than 16kb or 48kb, depending on GPU architecture and settings. There is also shared memory page size limitations which mean per block allocations get rounded up to the next largest multiple of the page size
  3. The number of blocks whose sum of per block register usage is less than 8192/16384/32678 depending on architecture. There is also register file page sizes which mean that per block allocations get rounded up to the next largest multiple of the page size.

That is all there is to it. There is no "paging" of shared memory to accomodate more blocks. NVIDIA produce a spreadsheet for computing occupancy which ships with the toolkit and is available as a separate download. You can see the exact rules in the formulas it contains. They are also discussed in section 4.2 of the CUDA programming guide.

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • so does this mean that sometimes it is better not to use shared memory? since more blocks will be run in parallel? – scatman Apr 11 '11 at 09:05
  • It really depends. Shared memory is a lot slower that register and register has no bank conflicts, so it is always better to use register over shared memory, if possible. The traditional use for shared memory was to allow data re-use between threads within a block, and in pre-fermi times it was very effective for that. In Fermi the case for shared memory can be a bit less compelling. The L1 and L2 caches mean that you can often hit a good fraction of what shared memory mint yield without doing anything, and there are not bank conflicts or serialization effects to worry about. – talonmies Apr 11 '11 at 09:29
  • So is it like if some blocks get scheduled on one SM at one instant and now all of the warps are waiting for memory operation to be completed, so will cuda scheduled other block on the same SM (what will happen to the shared memory data of already allocated blocks ?)or it will wait till the allocated blocks finished their operations ? – peeyush Apr 11 '11 at 12:13
  • The hardware will always schedule as many blocks as can run and then no more until such time as resources are available so that more can be scheduled. If every active warp on an SM was waiting for memory transactions or at a synchronization barrier, the SM would be stalled. Exactly how the scheduling heuristics work is not officially documented, but the concensus seems to be that in pre fermi cards, no new blocks would be scheduled until every block was finished on an SM, but on fermi it is more flexible than that. – talonmies Apr 11 '11 at 12:35