I need to determine both latency and throughput for (unsigned) modular multiplication in CUDA and on CPU (i5 750).
For the CPU I found this document, pg 121, for the Sandy Bridge, I am not really sure which one I should refer to, however for the "MUL IMUL r32" I get 4 cycles for the latency and reciprocal throughput equal 2. Then a "DIV r64" has latency 30-94 and rec.thr. 22-76.
Worst case scenario:
latency 94+4
rec.thr. 76+2
Right? Althought I am using OpenSSL to perform them, I am pretty sure at lowest level they always run simple modular multiplications.
Regarding CUDA, currently I am performing modular multiplications in PTX: multiplying 2 32b number, saving result on a 64b register, loading a 32b modulo on a 64b register and then do a 64b modulo.
If you look here, pg 76, they say throughput on Fermi 2.x for 32b integer multiplication is 16 (per clock-cycle per MP). Regarding modulo, they just say: "below 20 instructions on devices of compute capability 2.x"...
what does it mean exactly? Worst case 20 cycles per modulo per MP of latency? And throughput? How many modulos per MP?
Edit:
And what about if I have a warp where only the first 16 threads of a warp have to perform a 32b multiplication (16 ones per cycle per MP). Will the GPU busy for one cycle or two, although the second half has to do nothing?