fixed point multiplication without 64 bit temporary

Question

Hi I'm implementing some fixed point math stuff for embedded systems and I'm trying to do the multiplication of two 16.16 fixed point numbers without creating a 64bit temporary. So far here is the code I came up with that generates the least instructions.

int multiply(int x, int y){
    int result;
    long long temp = x;
    temp *= y;
    temp >>= 16;
    result = temp;
    return result;
}

the problem with this code is that it uses a temporary 64 bit integer which seem to generate bad assembly code. I'm trying to make a system that uses two 32 bit integers instead of a 64 bit one. Anyone know how to do this?

Did you see a 64-bit temporary in the disassembly dump of the output, or just in your head? Just because the source code has a variable of type `long long` does not mean there's actually any "64-bit temporary". On any decent 32-bit arch, a 32x32 multiply automatically generates a 64-bit result, usually separated into two 32-bit registers for your (or the compiler's) convenience. Manipulating these is much more efficient than breaking it down into 4 16x16 multiplies to avoid the "64-bit temporary". — R.. GitHub STOP HELPING ICE, Feb 27 '13 at 23:03
@R.. yeah I looked into the assembler dump. it uses two halves in the EAX and EDX register but it's implementation is not optimal — PgrAm, Feb 28 '13 at 00:31
It's surely a lot *less suboptimal* than performing four MULs... — R.. GitHub STOP HELPING ICE, Feb 28 '13 at 00:45
@R.. the idea is that now that I know how to do it in C I can implement it in hand optimized assembly — PgrAm, Feb 28 '13 at 00:56
Hope you mean doing the 32x32 -> 64 mul in asm -- that could help, but writing the C efficiently should produce near-identical results. Doing the 4 16x16 muls in asm will be horribly slow. — R.. GitHub STOP HELPING ICE, Feb 28 '13 at 01:29
most modern x86 compilers will automatically use the upper 32 bits of the result if you specify the optimization level high enough — phuclv, Mar 23 '14 at 10:39
ISO C has no extended-precision stuff like shift-between-variables, widening-multiply, or add-with-carry. Expressing something in C is useful to make sure you understand the algorithm (especially if the widest part is at most 64-bit so you don't have to manually do extended precision), but it's typically not a useful starting point for an efficient asm implementation. Unless an optimizing compiler does a good job with it. — Peter Cordes, Dec 27 '20 at 07:18

Doug Currie · Accepted Answer · 2013-02-27T22:41:05.093

6

Think of your numbers as each composed of two large "digits."

  A.B
x C.D

The "base" of the digits is the 2^bit_width, i.e., 2^16, or 65536.

So, the product is

D*B       + D*A*65536 + C*B*65536 + C*A*65536*65536

However, to get the product shifted right by 16, you need to divide all these terms by 65536, so

D*B/65536 + D*A       + C*B        + C*A*65536

In C:

uint16_t a = x >> 16;
uint16_t b = x & 0xffff;
uint16_t c = y >> 16;
uint16_t d = y & 0xffff;

return ((d * b) >> 16) + (d * a) + (c * b) + ((c * a) << 16);

The signed version is a bit more complicated; it is often easiest to perform the arithmetic on the absolute values of x and y and then fix the sign (unless you overflow, which you can check for rather tediously).

edited Feb 27 '13 at 22:41

answered Feb 27 '13 at 22:35

Doug Currie

40,708
1
95
119

Perfect exactly what I was looking for. – PgrAm Feb 27 '13 at 22:47
3

And this is going to be a lot slower than doing it the way you originally tried. – R.. GitHub STOP HELPING ICE Feb 27 '13 at 23:05
1

Well, the speed depends on the capabilities of the embedded processor (note that I answered the question before the EAX register was mentioned!). On a 16-bit processor, my approach will be slightly faster because a 32x32->64-bit multiply will also have 4 16-bit multiplies, but also must do 64-bit additions, whereas my approach only needs 32-bit additions (and no actual shifts since on a 16-bit processor the 16-bit words of the product will be individual accessible). – Doug Currie Feb 28 '13 at 23:40

fixed point multiplication without 64 bit temporary

1 Answers1

Linked