Determining latency and throughput for modular multiplication on CPU and CUDA

Question

I need to determine both latency and throughput for (unsigned) modular multiplication in CUDA and on CPU (i5 750).

For the CPU I found this document, pg 121, for the Sandy Bridge, I am not really sure which one I should refer to, however for the "MUL IMUL r32" I get 4 cycles for the latency and reciprocal throughput equal 2. Then a "DIV r64" has latency 30-94 and rec.thr. 22-76.

Worst case scenario:

latency 94+4
rec.thr. 76+2

Right? Althought I am using OpenSSL to perform them, I am pretty sure at lowest level they always run simple modular multiplications.

Regarding CUDA, currently I am performing modular multiplications in PTX: multiplying 2 32b number, saving result on a 64b register, loading a 32b modulo on a 64b register and then do a 64b modulo.

If you look here, pg 76, they say throughput on Fermi 2.x for 32b integer multiplication is 16 (per clock-cycle per MP). Regarding modulo, they just say: "below 20 instructions on devices of compute capability 2.x"...

what does it mean exactly? Worst case 20 cycles per modulo per MP of latency? And throughput? How many modulos per MP?

Edit:

And what about if I have a warp where only the first 16 threads of a warp have to perform a 32b multiplication (16 ones per cycle per MP). Will the GPU busy for one cycle or two, although the second half has to do nothing?

You could perhaps try using the clock() function and average the resulting values. — Reguj, Nov 06 '12 at 13:20
@Reguj Unfortunately I need to write down a theoretical model.. — elect, Nov 06 '12 at 16:00

njuffa · Accepted Answer · 2012-11-07T00:11:40.247

1

[Since you also asked the same question on the NVIDIA forums, http://devtalk.nvidia.com, I simply copied the answer I gave there to StackOverflow. In general, cross-references are helpful when questions are asked on multiple platforms.]

Latency is fairly meaningless with a throughput architecture like the GPU. The easiest way to determine throughput numbers for whatever operation you are interested in is to measure it on the device you plan to target. As far as I know, this is how the tables are generated for the CPU document you referenced.

To examine the machine code, you can disassemble the machine code (SASS) for the modulo operation using cuobjdump --dump-sass. When I do this for sm_20, I count a total of sixteen instructions for a 32/32->32 bit unsigned modulo. From the instruction mix, I would estimate the throughput to be around 20 billion operations per second on a Tesla C2050, across the entire GPU (note that this is a guesstimate, not a measured number!).

As for the 64/64->64 bit unsigned modulo, which is a called subroutine, I recently measured a throughput of 6.4 billion operations per second on a C2050 using CUDA 5.0.

You might want to look into the algorithms of Montgomery and Barrett for modular multiplications, instead of using division.

edited Nov 07 '12 at 00:11

answered Nov 06 '12 at 23:11

njuffa

23,970
4
78
130

Hi njuffa, nice to see you again :), thanks for your answer (+1). I wrote that because I need to write down a mathematical model that estimate the GPU speed up for a specific algorithm (RNS Montgomery Exponentiation). So far I would like to use the Amdahl's Law (strong scaling), that calculates the maximum theoretical speed-up. Problem is comparing CPU workload with GPU workload. Supposing to have k Modular Multiplication, if k = 34 for example, on CPU I will have 34*( 32b multiplication + 64b modulo). On GPU 2.0 how could I estimate? I thought to do the following: a first warp is fully – elect Nov 07 '12 at 07:04
executed, so if our throughput is 16 32b integer multiplications per clock-cycle per SM, then we need to spend 2 cycles to execute the first 32 multiplication + another cycle for the remaining 2 mul (we dont take in account the modulo yet). Is this correct? Or is there a better way to estimate it? Moreover, if throughput is 16 int.mult. /clock/MP, is it correct to say that the i-th warp needing to execute at least 16+1 mul. will keep busy the MP for 2 cycles, even if there are other warps ready to be served? – elect Nov 07 '12 at 07:14
I am afraid I don't know what "k modular multiplication" refers to. How much scaling you'd see would depend on how much parallelism you can expose. For relatively short operands (say up to 512 bits), you could do the exponentiation per thread, but would then need on the order of 10K threads to fill the GPU well (e.g. 512 threads per SM, each using 64 registers). As each thread in that schemem handles one exponentiation, it requires as many independent exponentiations as there are threads. – njuffa Nov 07 '12 at 13:56
Sorry, I will try to make it clear. Referring to a step of the algorithm, "k" indicates the number of 32b modular multiplication I need to perform in parallel. Each modular multiplication implies a simply 32b integer multiplication. Since I am launching a thread per modular multiplication, I do have a thread per 32b int. mul. Does it look better now? – elect Nov 07 '12 at 14:56
The C2050 can execute 515.2e9 simple integer operations per second, and half that throughput is available for integer-multiply-type instructions (IMUL or IMAD), i.e. 257.6e9 integer-multiply-type instructions/sec. A 256x256->256 bit multiply for example requires 2 32-bit IMUL and 62 32-bit IMAD instructions, for a total of 64 integer-multiply-type instructions. Thus the C2050 will be able to execute about four billion 256x256->256 bit multiplies per second. In a quick experiment, I measured 3.94e9 such multiplications per second (without subtracting out the tiny overhead of my test framework). – njuffa Nov 07 '12 at 18:08
Ok, I think I got your point of view. Thanks for your info, njuffa :) Do you also have some measurements on commercial Fermi and Kepler cards? – elect Nov 11 '12 at 18:24
If by "commercial cards" you mean "consumer card", no. I simply measured on the C2050 in my workstation I use at work. It is fairly straightforward to set up a simep, throughput test to measure the data you are interested in, but I do not have time for measuring additional operations. – njuffa Nov 11 '12 at 22:19
njuffa, I did some estimation based on your results regarding the 64/64->64 bit unsigned modulo.. How can be that possible? I mean: 6.4 billion/14 SMs=450 M/SM. Clock rate is 1150 MHz, so if we do 1150/450=2,5. This means that the average throughput per modulo is 2,5 cycles... I would expect that to be the result for multiplications, not modulos... Ok, the context switch run in parallel many warps, hiding latency and so on, but honestly since modulo are extremely expensive (almost 20 multiple instructions) I cant see how that could be possible.. – elect Nov 18 '12 at 11:50
I do not have the disassembly in front of me, but from memory, the 64/64->64 unsigned modulo is around 65 instructions or so, almost all of them are integer instructions: some simple, some integer-multiply types. As I noted, a C2050 can execute 515.2e9 simple integer operations / sec, or 257.6e9 integer-multiply-type instr /sec. 6.4e9 modulos/sec * 65 instr / modulo = 416e9 instr /sec, so the stated throughput is plausible. I can double-check modulo throughput and instruction count next time I have access to a C2050. Your estimate seems to omit parallelism inside each SM completely. – njuffa Nov 18 '12 at 18:56
1

I double-checked the 64-bit unsigned modulo with CUDA 5.0 on a C2050. The disassembly of the subroutine for sm_20 shows 67 instructions, out of which 36 are of the integer-multiply type (IMUL or IMAD). The measured throughput is 6.384e9 modulo operations per second. – njuffa Nov 21 '12 at 02:15
Thanks njuffa for the interesting information, a last question: how many blocks and threads do you launch? – elect Nov 21 '12 at 07:05
If I recall correctly, I used 65520 blocks of 256 threads each. – njuffa Nov 21 '12 at 11:20
I am running in these days a little program in my own to test the throughput. Right now I am using a 580. In order to get the maximum utilization from each SM (max threads/SM = 1536 and max threads/block = 1024) I decided to create blocks by 512 threads each. In this way I hope that every SM runs 3 blocks, reaching its maximum capacity (1536 threads/SM). So I launch 16 SMs * 3 blocks/SM = 48 blocks in total. But looking at your last comment, you have something completely different: a huge number of blocks with fewer threads each... why? – elect Mar 01 '13 at 09:29
On a C2050, to fill an SM optimally, in addition to the configuration you used, you can use 4 blocks of 384 threads, 6 blocks of 256 threads, or 8 blocks of 192 threads. As a heuristic, if there is a resource limitation that prevents running the maximum number of threads per SM, smaller granularity thread blocks make it easier to achieve high occupancy. Since 192 threads is a block size that does not often fit naturally with an app's thread-to-memory mapping, I often use 256 threads per block as a starting point. Using lots of thread blocks is useful in maximizing memory performance. – njuffa Mar 01 '13 at 18:01
Ah, interesting, that makes sense...but a resource limitation may also take place if one uses a dedicated gpu? (i.e. second gpu with no monitor attached) – elect Mar 02 '13 at 19:55
When each thread uses many registers, or each thread block uses a lot of shared memory, it is not possible to run the maximum possible number of threads on an SM. In those cases the occupancy is limited by one or several per-SM resources. – njuffa Mar 02 '13 at 22:29

Determining latency and throughput for modular multiplication on CPU and CUDA

1 Answers1