I'm running an issue on my K20 about resources with concurrent kernel execution. My streams only got a little overlap and then I thought this might because of a resources limitation. So I referred to the manual, and I found this: The maximum number of resident blocks per multiprocessor is 16 and the maximum number of resident threads per multiprocessor is 2048.
So my question is: if I have a kernel of 96 blocks of 1024 threads in each block. How many SMs will this kernel use in parallel?
Answer 1: 96/16 = 6
Answer 2: 1024/2048*96 = 48 ( K20 only has 13 SMs, so how will this kernel behave? )
Or maybe you have another answer?