modular arithmetic on the gpu

Question

I am working on the GPU algorithm which is supposed to do a lot of modular computations. Particularly, various operations on matrices in a finite field which in the long run reduce to primitive operations like: (a*b - c*d) mod m or (a*b + c) mod m where a,b,c and d are residues modulo m and m is a 32-bit prime.

Through experimentation I learned that the performance of the algorithm is mostly limited by slow modular arithmetic because integer modulo (%) and division operations are not supported on the GPU in hardware.

I appreciate if somebody can give me an idea how to realize efficient modular computations with CUDA ?

To see how this is implemented on CUDA, I use the following code snippet:

__global__ void mod_kernel(unsigned *gout, const unsigned *gin) {

unsigned tid = threadIdx.x;
unsigned a = gin[tid], b = gin[tid * 2], m = gin[tid * 3];

typedef unsigned long long u64;

__syncthreads();
unsigned r = (unsigned)(((u64)a * (u64)b) % m);
__syncthreads();
gout[tid] = r;
}

This code is not supposed to work, I just wanted to see how modular reduction is implemented on CUDA.

When I disassemble this with cuobjdump --dump-sass (thanks njuffa for advice!), I see the following:

/*0098*/     /*0xffffdc0450ee0000*/     BAR.RED.POPC RZ, RZ;
/*00a0*/     /*0x1c315c4350000000*/     IMUL.U32.U32.HI R5, R3, R7;
/*00a8*/     /*0x1c311c0350000000*/     IMUL.U32.U32 R4, R3, R7;
/*00b0*/     /*0xfc01dde428000000*/     MOV R7, RZ;
/*00b8*/     /*0xe001000750000000*/     CAL 0xf8;
/*00c0*/     /*0x00000007d0000000*/     BPT.DRAIN 0x0;
/*00c8*/     /*0xffffdc0450ee0000*/     BAR.RED.POPC RZ, RZ;

Note that between the two calls to bar.red.popc there is a call to 0xf8 procedure which implements some sophisticated algorithm (about 50 instructions or even more). Not surpising that mod (%) operation is slow

What platform are you on? A high-end Fermi GPU will likely give you the best performance among shipping cards. Are all operands 32-bit integers? You would want the operands to be of type "unsigned int" for best performance. The compiler generates very efficient code for division and modulo with fixed divisor, you can check the SASS (machine code) with cuobjdump --dump-sass. You may want to use templated functions with constant divisors if you only use a few modulos. For Fermi-based GPUs, division and modulo with variable divisors are performance-competitive with hardware-based solutions. — njuffa, Sep 03 '12 at 20:13
Hey Norbert, copy-and-paste this into the answers section where it belongs :-) — ArchaeaSoftware, Sep 03 '12 at 20:40
@njuffa yes I have Fermi-based graphics card and all numbers are 32-bit. The point is the moduli are not fixed: each block solves the same problem for a different modulus (or sometimes I use 2nd dimension to process 2-3 moduli per block if the problem size is too small). Actually the whole idea is to use GPU to solve a problem for different primes and then reconstruct the result on the host side. — , Sep 03 '12 at 20:53
Unsigned 32-bit division and modulo with a variable divisor compile to about 16 inlined machine instructions on Fermi and Kepler GPUs (if memory serves). The code is limited by the throughput of integer multiplies so you should see very good performance on Fermi-class GPUs, and a very decent speedup compared to CPUs. What performance are you observing? — njuffa, Sep 03 '12 at 21:05

score 14 · Accepted Answer · 2012-09-04T12:46:01.450

Some time ago I experimented a lot with modular arithmetic on the GPU. On Fermi GPUs you can use double-precision arithmetic to avoid expensive div and mod operations. For example, modular multiplication can be done as follows:

// fast truncation of double-precision to integers
#define CUMP_D2I_TRUNC (double)(3ll << 51)
// computes r = a + b subop c unsigned using extended precision
#define VADDx(r, a, b, c, subop) \
    asm volatile("vadd.u32.u32.u32." subop " %0, %1, %2, %3;" :  \
            "=r"(r) : "r"(a) , "r"(b), "r"(c));

// computes a * b mod m; invk = (double)(1<<30) / m
__device__ __forceinline__ 
unsigned mul_m(unsigned a, unsigned b, volatile unsigned m,
    volatile double invk) { 

   unsigned hi = __umulhi(a*2, b*2); // 3 flops
   // 2 double instructions
   double rf = __uint2double_rn(hi) * invk + CUMP_D2I_TRUNC;
   unsigned r = (unsigned)__double2loint(rf);
   r = a * b - r * m; // 2 flops

   // can also be replaced by: VADDx(r, r, m, r, "min") // == umin(r, r + m);
   if((int)r < 0) 
      r += m;
   return r;
}

However this only works for 31-bit integer modulos (if 1 bit is not critical for you) and you also need to precompute 'invk' beforehand. This gives absolute minimum of instructions I can achieve, ie.:

SHL.W R2, R4, 0x1;
SHL.W R8, R6, 0x1;
IMUL.U32.U32 R4, R4, R6;
IMUL.U32.U32.HI R8, R2, R8;
I2F.F64.U32 R8, R8;
DFMA R2, R2, R8, R10;
IMAD.U32.U32 R4, -R12, R2, R4;
ISETP.GE.AND P0, pt, R4, RZ, pt;
@!P0 IADD R4, R12, R4;

For description of the algorithm, you can have a look at my paper: gpu_resultants. Other operations like (xy - zw) mod m are also explained there.

Out of curiosity, I compared the performance of the resultant algorithm using your modular multiplication:

unsigned r = (unsigned)(((u64)a * (u64)b) % m);

against the optimized version with mul_m.

Modular arithmetic with default % operation:

low_deg: 11; high_deg: 2481; bits: 10227
nmods: 330; n_real_pts: 2482; npts: 2495

res time: 5755.357910 ms; mod_inv time: 0.907008 ms; interp time: 856.015015 ms; CRA time: 44.065857 ms
GPU time elapsed: 6659.405273 ms;

Modular arithmetic with mul_m:

low_deg: 11; high_deg: 2481; bits: 10227
nmods: 330; n_real_pts: 2482; npts: 2495

res time: 1100.124756 ms; mod_inv time: 0.192608 ms; interp time: 220.615143 ms; CRA time: 10.376352 ms
GPU time elapsed: 1334.742310 ms;

So on the average it is about 5x faster. Note also that, you might not see a speed-up if you just evaluate raw arithmetic performance using a kernel with a bunch of mul_mod operations (like saxpy example). But in real applications with control logic, synchronization barriers etc. the speed-up is very noticeable.

thanks for suggestion. I tried your mod_m algorthm, it works really fast.. Also I now get less register usage in my kernel )) Though on the downside I have to work with 31-bit primes instead of full 32-bits. But this is not very critical for me. — , Sep 04 '12 at 18:53

njuffa · Answer 2 · 2012-09-04T15:50:08.803

A high-end Fermi GPU (e.g. a GTX 580) will likely give you the best performance among shipping cards for this. You would want all 32-bit operands to be of type "unsigned int" for best performance, as there is some additional overhead for the handling of signed divisions and modulos.

The compiler generates very efficient code for division and modulo with fixed divisor As I recall it is usually around three to five machine instructions instructions on Fermi and Kepler. You can check the generated SASS (machine code) with cuobjdump --dump-sass. You might be able to use templated functions with constant divisors if you only use a few different divisors.

You should see on the order of sixteen inlined SASS instructions being generated for the unsigned 32-bit operations with variable divisor, across Fermi and Kepler. The code is limited by the throughput of integer multiplies and for Fermi-class GPUs is competitive with hardware solutions. Somewhat reduced performance is seen on currently shipping Kepler-class GPUs due to their reduced integer multiply throughput.

[Added later, after clarification of the question:]

Unsigned 64-bit division and modulo with variable divisor on the other hand are called subroutines of about 65 instructions on Fermi and Kepler. They look close to optimal. On Fermi, this is still reasonably competitive with hardware implementations (note that 64-bit integer divisions are not exactly super fast on CPUs that provide this as a built-in instruction). Below is some code that I posted to the NVIDIA forums some time back for the kind of task described in the clarification. It avoids the expensive division, but does assume that fairly large batches of operands are sharing the same divisior. It uses double-precision arithmetic, which is especially fast on Tesla-class GPUs (as opposed to consumer cards). I only did a cursory test of the code, you might want to test this more carefully before deploying it.

// Let b, p, and A[i] be integers < 2^51
// Let N be a integer on the order of 10000
// for i from 1 to N
// A[i] <-- A[i] * b mod p

/*---- kernel arguments ----*/
unsigned long long *A;
double b, p; /* convert from unsigned long long to double before passing to kernel */
double oop;  /* pass precomputed 1.0/p to kernel */

/*---- code inside kernel -----*/
double a, q, h, l, rem;
const double int_cvt_magic = 6755399441055744.0; /* 2^52+2^51 */

a = (double)A[i];

/* approximate quotient and round it to the nearest integer */
q = __fma_rn (a * b, oop, int_cvt_magic);
q = q - int_cvt_magic;

/* back-multiply, representing p*q as a double-double h:l exactly */
h = p * q;
l = __fma_rn (p, q, -h);

/* remainder is double-width product a*b minus double-double h:l */
rem = __fma_rn (a, b, -h);
rem = rem - l;

/* remainder may be negative as quotient rounded; fix if necessary */
if (rem < 0.0) rem += p;

A[i] = (unsigned long long)rem;

thanks for suggestions: I found cuobjdump disassembler really handy tool. Unfortunately in my case I need to perform modular reduction of the product of two 32-bit integers which finally boils down to computing 64-bit product modulo m. In disassembler it really looks awful, that is, there is a call to procedure with more thatn 50 instructions. I edited my question to add some small example with my investigations — , Sep 04 '12 at 09:05
Thanks for the clarification. Unsigned 64-bit division and modulo are called subroutines with about 65 instructions altogether, which is near optimal. On Fermi, that is almost competitive with hardware implementations (64-bit divisions aren't particularly fast on CPUs either that have this as a built-in instruction). — njuffa, Sep 04 '12 at 15:26
this is a good solution, thanks. Now I have an idea how this works. Though in my case I use 32-bit primes because 51-bit primes will occupy too much register space. — , Sep 04 '12 at 18:47

score 1 · Answer 3 · edited May 23 '17 at 11:52

1

There are tricks to efficiently perform mod operations but if only m is radix 2.

For instance, x mod y == x & (y-1), where y is 2^n. Performing bitwise operation is the fastest.

Otherwise, probably a look-up table? Below is a link on discussion of efficient modulo implementation. You might need to implement it yourself to get the most out of it.

Efficient computation of mod

edited May 23 '17 at 11:52

Community

1
1

answered Sep 04 '12 at 03:00

Hong Zhou

659
1
9
20

1

yes this is clear that we can use bitwise arithmetic for power-of-two moduli but in my case I have arbitrary 32-bit primes – Sep 04 '12 at 09:12

modular arithmetic on the gpu

3 Answers3

Linked