128x128 bit multiplication on for x86 CPU

Question

In my application I need a fast 128x128-bit multiplication (result = 256 bit). Is there any x86 optimised library out there performing this operation?

Yeah, but it does not support a 256-bit integer for the result. — Martin T., Apr 15 '15 at 18:32
related: [Multiplying two 128-bit ints](http://stackoverflow.com/q/22085516/995714). You can also multiply 2 `int128_t`s to get the low 128 bits and then calculate the high bits manually — phuclv, Jun 24 '15 at 07:59
[C++ 128/256-bit fixed size integer types](http://stackoverflow.com/q/5242819/995714) — phuclv, Jun 24 '15 at 08:04
With (Intel) CPUs supporting MULX, ADCX and ADOX instructions, [using separate carry chains](http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/ia-large-integer-arithmetic-paper.html) should beat Toom-Cook(-2) up to a couple of words/limbs. — greybeard, Jan 29 '16 at 11:07

score 2 · Answer 1 · answered Apr 16 '15 at 03:38

There is GNU GMP library - https://gmplib.org/ which should have good optimized multiplications for long integers. It has benchmark https://gmplib.org/download/misc/gmpbench-0.2.tar.bz2 which can be used to test 128x128 case (multiply.c, args 128 128)

For fixed sizes you can try low-level interfaces of GMP - mpn https://gmplib.org/manual/Low_002dlevel-Functions.html

Function: mp_limb_t mpn_mul (mp_limb_t *rp, const mp_limb_t *s1p, mp_size_t s1n, const mp_limb_t *s2p, mp_size_t s2n) Multiply {s1p, s1n} and {s2p, s2n}, and write the (s1n+s2n)-limb result to rp. Return the most significant limb of the result.

The destination has to have space for s1n + s2n limbs, even if the product's most significant limb is zero. No overlap is permitted between the destination and either source.

This function requires that s1n is greater than or equal to s2n.

For some special cases on haswell claimed speed is 1.57-1.8 cycles/limb ("Normally a limb contains 32 or 64 bits") http://code.metager.de/source/xref/gnu/gmp/mpn/x86_64/coreihwl/mul_1.asm#35

There is `mpn_mul_n` since both operands have the same size. — Marc Glisse, Apr 16 '15 at 19:25

GJ. · Answer 2 · 2015-04-16T20:45:15.453

2

If you need only a fast 128x128-bit multiplication than you can do this by yourself.

Under 32 bit CPU you need 16 (32*32 bit) multiplications and under 64 bit CPU 4 (64*64 bit) multiplications.

The algorithem under 32 bit CPU (using 32 bit multiplication) is:

Let's say that ABCD and EFGH present two 128 bit numbers and any letter present a 32 bit digit of 128 bit number.

ABCD * EFGH =  
  ABCD * E * 2^96 //Multiplication with 2^96 is 96 left shift or mov for 3 32bit digits 
+ ABCD * F * 2^64 
+ ABCD * G * 2^32 
+ ABCD * H

and where n is 32 bit digit.

ABCD * n =  
  A * n * 2^96 //Multiplication with 2^96 is 96 left shift or mov for 3 32bit digits
+ B * n * 2^64
+ C * n * 2^32 
+ D * n

edited Apr 16 '15 at 20:45

answered Apr 16 '15 at 15:32

GJ.

10,810
2
45
62

Many ISAs provide widening multiplication (32x32 => 64-bits in a pair of registers), e.g. x86, ARM, and MIPS. (Some stripped-down ARM cores only have narrow mul). In C, compilers typically know how to optimize `a * (uint64_t)b` into a widening multiply. (But the problem with C comes when you try to get the compiler to emit an `adc` add-with-carry) – Peter Cordes Dec 27 '20 at 05:29

128x128 bit multiplication on for x86 CPU

2 Answers2

Linked