4

Hi I'm implementing some fixed point math stuff for embedded systems and I'm trying to do the multiplication of two 16.16 fixed point numbers without creating a 64bit temporary. So far here is the code I came up with that generates the least instructions.

int multiply(int x, int y){
    int result;
    long long temp = x;
    temp *= y;
    temp >>= 16;
    result = temp;
    return result;
}

the problem with this code is that it uses a temporary 64 bit integer which seem to generate bad assembly code. I'm trying to make a system that uses two 32 bit integers instead of a 64 bit one. Anyone know how to do this?

tletnes
  • 1,958
  • 10
  • 30
PgrAm
  • 671
  • 2
  • 7
  • 19
  • 4
    Did you see a 64-bit temporary in the disassembly dump of the output, or just in your head? Just because the source code has a variable of type `long long` does not mean there's actually any "64-bit temporary". On any decent 32-bit arch, a 32x32 multiply automatically generates a 64-bit result, usually separated into two 32-bit registers for your (or the compiler's) convenience. Manipulating these is much more efficient than breaking it down into 4 16x16 multiplies to avoid the "64-bit temporary". – R.. GitHub STOP HELPING ICE Feb 27 '13 at 23:03
  • @R.. yeah I looked into the assembler dump. it uses two halves in the EAX and EDX register but it's implementation is not optimal – PgrAm Feb 28 '13 at 00:31
  • It's surely a lot *less suboptimal* than performing four MULs... – R.. GitHub STOP HELPING ICE Feb 28 '13 at 00:45
  • 1
    @R.. the idea is that now that I know how to do it in C I can implement it in hand optimized assembly – PgrAm Feb 28 '13 at 00:56
  • Hope you mean doing the 32x32 -> 64 mul in asm -- that could help, but writing the C efficiently should produce near-identical results. Doing the 4 16x16 muls in asm will be horribly slow. – R.. GitHub STOP HELPING ICE Feb 28 '13 at 01:29
  • 1
    most modern x86 compilers will automatically use the upper 32 bits of the result if you specify the optimization level high enough – phuclv Mar 23 '14 at 10:39
  • ISO C has no extended-precision stuff like shift-between-variables, widening-multiply, or add-with-carry. Expressing something in C is useful to make sure you understand the algorithm (especially if the widest part is at most 64-bit so you don't have to manually do extended precision), but it's typically not a useful starting point for an efficient asm implementation. Unless an optimizing compiler does a good job with it. – Peter Cordes Dec 27 '20 at 07:18

1 Answers1

6

Think of your numbers as each composed of two large "digits."

  A.B
x C.D

The "base" of the digits is the 2^bit_width, i.e., 2^16, or 65536.

So, the product is

D*B       + D*A*65536 + C*B*65536 + C*A*65536*65536

However, to get the product shifted right by 16, you need to divide all these terms by 65536, so

D*B/65536 + D*A       + C*B        + C*A*65536

In C:

uint16_t a = x >> 16;
uint16_t b = x & 0xffff;
uint16_t c = y >> 16;
uint16_t d = y & 0xffff;

return ((d * b) >> 16) + (d * a) + (c * b) + ((c * a) << 16);

The signed version is a bit more complicated; it is often easiest to perform the arithmetic on the absolute values of x and y and then fix the sign (unless you overflow, which you can check for rather tediously).

Doug Currie
  • 40,708
  • 1
  • 95
  • 119
  • Perfect exactly what I was looking for. – PgrAm Feb 27 '13 at 22:47
  • 3
    And this is going to be a lot slower than doing it the way you originally tried. – R.. GitHub STOP HELPING ICE Feb 27 '13 at 23:05
  • 1
    Well, the speed depends on the capabilities of the embedded processor (note that I answered the question before the EAX register was mentioned!). On a 16-bit processor, my approach will be slightly faster because a 32x32->64-bit multiply will also have 4 16-bit multiplies, but also must do 64-bit additions, whereas my approach only needs 32-bit additions (and no actual shifts since on a 16-bit processor the 16-bit words of the product will be individual accessible). – Doug Currie Feb 28 '13 at 23:40