can multiprecision signed multiply be performed with imul instruction?

Question

I am writing a function library to provide all conventional operators and functions for signed-integer types s0128, s0256, s0512, s1024 and floating-point types f0128, f0256, f0512, f1024.

I am writing the s0128, s0256, s0512, s1024 multiply routines now, but am getting erroneous results that confuse me. I assumed I could cascade multiplies with the 64-bit imul rcx instruction (that produces a 128-bit result in rdx:rax) in the same way I could do the same with unsigned operands with the mul rcx instruction... but the answers with imul are wrong.

I suspect there is some trick to make this work, maybe mix imul and mul instructions - or something. Or is there some reason one cannot implement larger multiplies with signed multiply instructions?

So you understand the technique, I'll describe the smallest version, for s0128 operands.

           arg2.1   arg2.0  : two 64-bit parts of s0128 operand
           arg1.1   arg1.0  : two 64-bit parts of s0128 operand
           ---------------
       0  out.edx  out.eax  : output of arg1.0 * arg2.0
 out.edx  out.eax           : output of arg1.0 * arg2.1
 -------------------------
 out.2    out.1    out.0    : sum the above intermediate results
 out.edx  out.eax           : output of arg1.1 * arg2.0
 -------------------------
 out.2    out.1    out.0    : sum the above intermediate results

Each time the code multiplies two 64-bit values, it generates a 128-bit result in edx:eax. Each time the code generates a 128-bit result, it sums that result into an accumulating triple of 64-bit registers with addq, adcq, adcq instructions (where the final adcq instruction only adds zero to assure any carry flags gets propagated).

When I multiply small negative numbers by small positive numbers as a test, the result is negative, but there are one or two non-zero bits at the bottom of the upper 64-bit value in the 128-bit s0128 result. This implies to me that something isn't quite right with propagation in multiprecision signed multiplies.

Of course the cascade is quite a bit more extensive for s0256, s0512, s1024.

What am I missing? Must I convert both operands to unsigned, perform unsigned multiply, then negate the result if one (but not both) of the operands was negative? Or can I compute multiprecision results with the imul signed multiply instruction?

Joe Z · Accepted Answer · 2013-12-25T19:10:42.287

When you build an extended precision signed multiply out of smaller multiplies, you end up with a mixture of signed and unsigned arithmetic.

In particular, if you break a signed value in half, you treat the upper half as signed, and the lower half as unsigned. The same is true for extended precision addition, in fact.

Consider this arbitrary example, where AH and AL represent the high and low halves of A, and BH and BL represent the high and low halves of B. (Note: these aren't meant to represent x86 register halves, just halves of a multiplicand.) The L terms are unsigned and the H terms are signed.

              AH : AL
           x  BH : BL
  -------------------
              AL * BL    unsigned x unsigned => zero extend to full precision
         AH * BL           signed x unsigned => sign extend to full precision
         AL * BH         unsigned x   signed => sign extend to full precision
    AH * BH                signed x   signed

The AL * BL product is unsigned because both AL and BL are unsigned. Therefore, it gets zero extended when you promote it to the full precision of the result.

The AL * BH and AH * BL products mix signed and unsigned values. The resulting product is signed, and that needs to be sign extended when you promote it to the full precision of the result.

The following C code demonstrates a 32×32 multiply implemented in terms of 16×16 multiplies. The same principle applies when building 128×128 multiplies out of 64×64 multiplies.

#include <stdint.h>
#include <stdio.h>

int64_t mul32x32( int32_t x, int32_t y )
{
    int16_t x_hi = 0xFFFF & (x >> 16);
    int16_t y_hi = 0xFFFF & (y >> 16);

    uint16_t x_lo = x & 0xFFFF;
    uint16_t y_lo = y & 0xFFFF;


    uint32_t lo_lo = (uint32_t)x_lo * y_lo;    // unsigned x unsigned
    int32_t  lo_hi = (x_lo * (int32_t)y_hi);   // unsigned x   signed
    int32_t  hi_lo = ((int32_t)x_hi * y_lo);   //   signed x unsigned
    int32_t  hi_hi = ((int32_t)x_hi * y_hi);   //   signed x   signed


    int64_t  prod = lo_lo 
                  + (((int64_t)lo_hi + hi_lo) << 16) 
                  + ((int64_t)hi_hi << 32);

    return prod;
}

int check(int a, int b)
{
    int64_t ref = (int64_t)a * (int64_t)b;
    int64_t tst = mul32x32(a, b);

    if (ref != tst)
    {
        printf("%.8X x %.8X => %.16llX vs %.16llX\n",
                (unsigned int)a,         (unsigned int)b, 
                (unsigned long long)ref, (unsigned long long)tst);
        return 1;
    }

    return 0;
}


int main()
{
    int a = (int)0xABCDEF01;
    int b = (int)0x12345678;
    int c = (int)0x1234EF01;
    int d = (int)0xABCD5678;

    int fail = 0;

    fail += check(a, a);
    fail += check(a, b);
    fail += check(a, c);
    fail += check(a, d);

    fail += check(b, b);
    fail += check(b, c);
    fail += check(b, d);

    fail += check(c, c);
    fail += check(c, d);

    fail += check(d, d);

    printf("%d tests failed\n", fail);
    return 0;
}

This pattern extends even if you break the multiplicands into more than two pieces. That is, only the most-significant piece of a signed number gets treated as signed. All of the other pieces are unsigned. Consider this example, which divides each multiplicand into 3 pieces:

                      A2 : A1 : A0
                   x  B2 : B1 : B0
  ---------------------------------
                           A0 * B0    => unsigned x unsigned   => zero extend
                      A1 * B0         => unsigned x unsigned   => zero extend
                 A2 * B0              =>   signed x unsigned   => sign extend
                      A0 * B1         => unsigned x unsigned   => zero extend
                 A1 * B1              => unsigned x unsigned   => zero extend
            A2 * B1                   =>   signed x unsigned   => sign extend
                 A0 * B2              => unsigned x   signed   => sign extend
            A1 * B2                   => unsigned x   signed   => sign extend
       A2 * B2                        =>   signed x   signed

Because of all the mixed-signedness and sign extension fun, it's often just easier to implement a signed × signed multiply as an unsigned × unsigned multiply, and conditionally negate at the end if the signs if the multiplicands differ. (And, in fact, when you get to the extended precision float, as long as you stay in sign-magnitude form like IEEE-754, you won't have to deal with signed multiply.)

This assembly gem shows how to negate extended precision values efficiently. (The gems page is a little dated, but you may find it interesting / useful.)

Thanks for the detailed reply. A few questions. You talk about multiplies with "sign extend" and "zero extend". What does this mean in practice, in assembly language? If either operand is the ms64 (most significant 64-bit value of the operand), the multiply should be performed with `imul` instead of `mul`? Do I need to sign extend all the way to the top of my s1024 result (which has 16 64-bit components)? That would be a huge burden, especially the way my code implements the multiply (because I finish computing the low portion, save it, then gradually move upward through the result). — honestann, Dec 25 '13 at 22:55
You talked about implementing these signed-integer multiplies with the `mul` unsigned multiply instruction, then negate if exactly one operand was negative (an xor of the operand ms64s is negative). That is how I got around the difficulty... except I replaced any negative operand with the negation of itself before performing the unsigned multiply. Then I negate the result in the same way you say. However, you seem to be saying I don't need to negate negative operands before I perform the multiply. Is that correct? — honestann, Dec 25 '13 at 23:00
@honestann: Sorry I wasn't clear. I did mean that you could implement it as an unsigned multiply, taking the product of the absolute values of the inputs, and then negating if one input was positive and the other was negative. My apologies for being unclear. There are other ways of doing signed multiplies with an unsigned multiply instruction. I recall Hacker's Delight gives another way, but my copy is at work. — Joe Z, Dec 25 '13 at 23:13
I have hackers delight (but not the newest edition), so I'll take a look and see. But if you can answer my first question above, that would be great. I'm still confused by your terminology - how to implement with assembly language (which multiply instructions, and exactly what you mean by zero/sign-extend when the result can be as huge as 1024-bits (quite a long way to sign-extend!)). — honestann, Dec 25 '13 at 23:51

can multiprecision signed multiply be performed with imul instruction?

1 Answers1

Linked