I have written an application in CUDA, which uses 1kb of shared memory in each block.
Since there is only 16kb of shared memory in each SM, only 16 blocks can be accommodated overall, right? Though at a time only 8 can be scheduled, but now if some block is busy in doing memory operations, another block will be scheduled on the GPU, but all the shared memory is used by the other 16 blocks which already have been scheduled there.
So will CUDA not schedule more blocks on the same SM, unless previous allocated blocks are completely finished?
Or will it move some block's shared memory to global memory, and allocate other block there? In this case should we worry about global memory access latency?