According to the GK110 whitepaper, each SMX has a maximum of 64 warps and a maximum thread capacity of 2048 threads.
My question is this: Does each SMX always operate at this maximum resident warp number of 64 (assuming no thread divergence and a block size that is a multiple of 64)?
I have reason to believe that if your number of threads on an SMX < 1024, you will only get a maximum of 32 warps per multiprocessor.
(I believe this because my similarly clocked Fermi card is showing similar speeds to my Kepler card when the number of threads is 1024 on 1 block when running the same code)