3

I'm writing some C code for a research project in number theory, which requires to do a lot of operations in modular arithmetic, with many different moduli. To put it simple: I need to do the operation (a * b) % n many many times.

The code is meant to run on a PC with 64 bits words, and all the moduli are known to be less than 2^64, so all the operands are implemented by unsigned 64 bits integers.

My question is: Would using Montgomery modular multiplication (that makes use only of addition and multiplication) instead of the C modulo operator % (that translates to a % n = a - n*(a / n) and uses also a division) result in a faster execution?

Intuitively, I would say that the answer is: No, because (word-size) divisions on a PC are not too much computationally expensive than (word-size) multiplications, and Montgomery reduction would actually cause an overhead.

Thank for any suggestions.

Update: On the one hand, according to Paul Ogilvie (see his comment below), (a * b) % n requires 1 multiplication and 1 division. On the other hand, Montgomery multiplication requires (ignoring the operations needed to convert, and convert back, the operands to their Montgomery representations, since they are done once time only for every modulo n; and the binary shifts) 3 multiplications. So it would seem that Montgomery is faster than ``%'' as soon as multiplication is performed two times faster than division...

Norian
  • 31
  • 2
  • The modulo operation is implemented on most CPUs as an instruction. I expect it to be faster than a sequence of instructions. (Note that on the micro-code level of the CPU it can be a sequence of operations. The only way to be absolutely sure is to lookup the CPU timing data). – Paul Ogilvie Jan 20 '20 at 12:33
  • Actually, modulo is implemented as the DIV instruction with the quotient placed in one register and the remainder in another register. – Paul Ogilvie Jan 20 '20 at 12:41
  • 1
    It is my understanding that montgomery multiplication is often used because of its constant-time operations (which are important in crypto applications), not because of performance. However, the performance is higher than other constant-time methods as far as I know, therefore you may see montgomery multiplication mentioned for its "performance". – Morten Jensen Jan 20 '20 at 12:42
  • I could not find instruction timing in the Intel _64-ia-32-architectures-software-developer-manual-325462_ – Paul Ogilvie Jan 20 '20 at 12:46
  • @PaulOgilvie: Instruction timing does not belong in an architecture manual because an architecture manual specifies the instruction set—what functions the instructions perform. How fast they are depends on the implementation, which varies from processor model to processor model. Instruction timing belongs in processor manuals. – Eric Postpischil Jan 20 '20 at 13:01
  • OP : I don't have specific experience, but I have seen Karatsuba multiplication touted as a performance enhancer when working with large integers. As far as I understand it, the point of break-even can be a bit fluid, so you may have to benchmark and measure your way. See this answer on crypto.se : https://crypto.stackexchange.com/a/6469/51068 – Morten Jensen Jan 20 '20 at 14:33

2 Answers2

3

Your intution is incorrect. Division is many times slower than multiplication, whether it is for integers or floating points. See this excellent answer about a similar question. The exact difference in speed depends on which CPU you are running on, whether the code can be vectorized, and even on what the rest of the code is doing around the same time.

If you do integer divide by a constant, for example if you know n at compile time, then the compiler could transform this into a sequence of multiplications and shifts, maybe even doing exactly the same as Montgomery modular multiplication. If n is not known at compile time, then it is probably worthwile to implement Montgomery modular multiplication.

However, the best answer you will get is when you implement both versions of your code, and benchmark it.

G. Sliepen
  • 7,637
  • 1
  • 15
  • 31
  • The fact that division is many times slower than multiplication does not prove a division implementation is necessarily slower than a Montgomery implementation. Can you demonstrate that a good Montgomery implementation on current Intel or AMD processor models is necessarily slower than a good division implementation? – Eric Postpischil Jan 20 '20 at 13:04
  • @EricPostpischil But that isn't the question. The question is whether a hand-written Montgomery modular multiplication is faster than naive modular multiplication using the modulo operator. And I'm not claiming it will be faster, just that it might be. – G. Sliepen Jan 20 '20 at 13:45
  • This answer does not present a view that is balanced and does not show data to support it. – Eric Postpischil Jan 20 '20 at 14:20
0

all the moduli are known to be less than 2^64, so all the operands are implemented by unsigned 64 bits integers.

However, a * b is 128 bit which complicates the story. div takes a 128 bit dividend and a * b / n < n so it cannot overflow the division (that would imply out-of-range inputs) so that's trivial to write in x64 assembly:

; compute (A * B) % N
; A: rax
; B: rdx
; N: rcx
; result: rdx
mul rdx
div rcx

And in C the above is impossible to write, except with some special things such as __uint128_t or _mul128 and _div128. However you get that code to appear, this form of div is the slowest possible form, look for ":DIV r64 128/64b (full)" in for example Haswell instruction timing dump. Nearly a hundred cycles, on pre-IceLake CPUs this is basically worse than anything else you can do, except implementing bit-by-bit division yourself. Ice Lake is different and finally has decent integer division, at 18 cycles (add +4 for the initial mul for the overall modmul) it's still not fast but at least it's not an order of magnitude off the mark and perhaps worth considering (including buying the new hardware) because:

with many different moduli

This can break everything, depending on how many is many. Montgomery multiplication requires finding some funny modular inverse, modulo a power of two so you can at least use Hensel lifting instead of Extended Euclidean, but it still costs a dozen multiplications plus some extra stuff, all serial. Similarly, Barrett reduction requires finding some funny fixed-point reciprocal approximation, which is a fancy way of saying it requires a division up front. With too many different moduli, the tricks are useless.

harold
  • 61,398
  • 6
  • 86
  • 164