Optimal frequency of modulo operation in finite field arithmetic implementation

Question

I'm trying to implement finite field arithmetic to use it in Elliptic Curve calculations. Since all that's ever used are arithmetic operations that commute with the modulo operator, I don't see a reason not to delaying that operation till the very end. One thing that may happen is that the numbers involved might become (way) too big and impractical/inefficient to work with, but I was wondering if there was a way to determine the optimal conditions/frequency which should trigger a modulo operation in the calculations.

I'm coding in C.

Maybe it is. Posted here because it pertained coding performance specifically, but will post there as well. Thanks! — popeye, Jul 21 '22 at 22:25
Yeah that's why I said "might" be better. I'm not sure. Your question is definitely coding related but the repercussions of what you are changing in regard to the algorithm might be better suited for crypto. — Alex, Jul 21 '22 at 22:26
I also would suspect that the question you asked here is going to be dependent on the target environment in which this will run. What is "optimal" in one architecture may not be portable in a general sense. — Alex, Jul 21 '22 at 22:29
I would suspect that different costs for mul and add operations would definitely change a few things but I'm mostly looking for a ballpark. Like definitely don't do it this often but also not that spaced apart. Or ideally something as a function of the relative cost of the operations. — popeye, Jul 21 '22 at 22:32
One of the reasons modular arithmetic appears so often in cryptography is that it yields a finite field but also because it’s the only practical way to work with large integers on a computer, if you can work with residual algebra. If you ever find yourself multiplying two 256 bit numbers and expect any subsequent multiplication to be efficient, you’re in for a bad time, subsequent operations are exponentially slower. It’s easy to benchmark and see. — OregonTrail, Jul 23 '22 at 10:05

score 1 · Accepted Answer · answered Jul 22 '22 at 02:40

To avoid the complexity of elliptic curve crypto (as I'm unfamiliar with its algorithm); let assume you're doing temp = (a * b) % M; result = (temp * c) % M, and you're thinking about just doing result = (a * b * c) % M instead.

Let's also assume that you're doing this a lot with the same modulo M; so you've precomputed "multiples of M" lookup tables, so that your modulo code can use the table to find the highest multiple of "M shifted left by N" that is not greater than the dividend and subtract it from dividend, and repeat that with decreasing values of N until you're left with the quotient.

If your lookup table has 256 entries, the dividend is 4096 bits and the divisor is 2048 bits; then you'd reduce the size of the dividend by 8 bits per iteration, so dividend would become smaller than the divisor (and you'd find the quotient) after no more than 256 "search and subtract" operations.

For multiplication; it's almost purely "multiply and add digits" for each pair of digits. E.g. using uint64_t as a digit, multiplying 2048 bit numbers is multiplying 32 digit numbers and involves 32 * 32 = 1024 of those "multiply and add digits" operations.

Now we can make comparisons. Specifically, assuming a, b, c, M are 2048-bit numbers:

a) the original temp = (a * b) % M; result = (temp * c) % M would be 1024 "multiply and add", then 256 "search and subtract", then 1024 "multiply and add", then 256 "search and subtract". For totals it'd be 2048 "multiply and add" and 512 "search and subtract".

b) the proposed result = (a * b * c) % M would be 1024 "multiply and add", then would be 2048 "multiply and add" (as the result of a*b will be a "twice as big" 4096-bit number), then 512 "search and subtract" (as a*b*c will be twice as big as a*b). For totals it'd be 3072 "multiply and add" and 512 "search and subtract".

In other words; (assuming lots of assumptions) the proposed result = (a * b * c) % M would be worse, with 50% more "multiply and add" and the exact same "search and subtract".

Of course none of this (the operations you need for elliptic curve crypto, the sizes of your variables, etc) can be assumed to apply for your specific case.

I was wondering if there was a way to determine the optimal conditions/frequency which should trigger a modulo operation in the calculations.

Yes; the way to determine the optimal conditions/frequency is to do similar to what I did above - determine the true costs (in terms of lower level operations, like my "search and subtract" and "multiply and add") and compare them.

In general (regardless of how modulo is implemented, etc) I'd expect you'll find that doing modulo as often as possible is the fastest option (as it reduces the cost of multiplications and also reduces the cost of later/final modulo) for all cases don't involve addition or subtraction, and that don't fit in simple integers.

Elliptic curves typically use much smaller moduli, 256 bits is common and 521 is the high-end. — President James K. Polk, Jul 22 '22 at 20:05
This was a cool answer, thanks. However, things start getting more complex as when doing (a*b + c*d) it doesn't make much sense to take the modulo three times. I guess I should try to benchmark the specific algorithm and look for triggers such as (a*b*c) — popeye, Jul 23 '22 at 16:00
@popeye: Yeah, for cases that involve addition/subtraction (e.g. `(x % M + y % M) % M` vs. `(x + y) % M`), the addition typically increases the number of bits you're dealing with by 1 (e.g. 256 bits + 256 bits = 257 bits worst case) so the extra modulos don't save you much/anything. — Brendan, Jul 23 '22 at 17:20

rcgldr · Answer 2 · 2022-07-23T10:04:34.707

If M is a constant, then an alternative for modulo is to multiply by the logical inverse of M. Looking at Polk's comment about 256 bits being a common case, then assuming M is polynomial of degree 256 with 1 bit coefficients, then define the inverse of M to be x^512 / M, which results in a 256 bit "inverse". Name this inverse to be I. Then for a multiply modulo M:

C = A * B                            ; 512 bit product
Q = (upper 256 bits of C * I)>>256   ; Q = C / M = 256 bit quotient
P = M * Q                            ; 512 bit product
R = lower 256 bits of (C xor P)      ; (A * B)% M

So this require 3 extended precision multiplies and one xor.

If the processor for this code has a carryless multiply, such as X86 PCLMULQDQ, which multiplies two 64 bit operands to produce a 128 bit result, then that could be used as the basis for an extended precision multiply. A basic implementation would need 16 multiplies for a 256 bit by 256 bit multiply to produce a 512 bit product. This could be improved using somthing like Karatsuba:

https://en.wikipedia.org/wiki/Karatsuba_algorithm

but on currernt X86, PCLMULQDQ is fast, taking 1 to 3 cycles, so the main issue would be loading the data into the XMM registers, and I'm not sure Karatsuba would save much time.

score 0 · Answer 3 · answered Aug 08 '22 at 16:50

optimal conditions/frequency which should trigger a modulo operation in the calculations

Standard practice is to replace all actual modulo operations with something else. So the frequency is never. There are different ways to accomplish that:

Choose the modulus to be a Mersenne prime or pseudo-Mersenne prime. There is a large repertoire of mathematical tricks to implement arithmetic modulo a (pseudo-)Mersenne prime efficiently, without doing any actual modulo operations. In the context of elliptic curves, the prime-modulus NIST curves are chosen this way and for this reason.
Use Barrett reduction. This has the same effect as a real modulo operation, but relies on some precomputation and a precondition on the range of the input to be able to reduce the cost of a modulo-like operation to the cost to a couple of multiplications (plus some supporting operations). Also applicable to polynomial fields.
Do arithmetic in Montgomery form.

Additionally, and perhaps more in the spirit of your question, a common technique is to do various additions without reducing every time (addition does not significantly change the size of a number). It takes a lot of additions before you need an extra limb in your integers, so a lot of them can be done before it starts to make sense to reduce. For multiplications, unless it's by a small constant it almost always makes sense to reduce immediately afterwards to prevent the numbers from getting much physically larger than they need to be (which would be especially bad if the result was fed into another multiplication).

Another technique especially associated with Barrett reductions is to work, most of the time, in a slightly larger range than [0 .. N), eg [0 .. 2N). This enables skipping the conditional subtraction that Barrett reduction needs in order to fully reduce to the range [0 .. N), while still using the most important part, the reduction from the range [0 .. N²) to the range [0 .. 2N).

Optimal frequency of modulo operation in finite field arithmetic implementation

3 Answers3