3

I'm solving problem in a book called 'Computer Systems'. Here is the problem I'm struggling with.


Question: The following code computes the 128-bit product of two 64-bit signed values x and y and stores the result in memory:

1   typedef __int128 int128_t;
2
3   void store_prod(int128_t *dest, int64_t x, int64_t y){
4       *dest = x * (int128_t) y;
5   }

Gcc generates the following assembly code implementing the computation:

1   store_prod:
2     movq    %rdx, %rax
3     cqto
4     movq    %rsi, %rcx
5     sarq    $63, %rcx
6     imulq   %rax, %rcx
7     imulq   %rsi, %rdx
8     addq    %rdx, %rcx
9     mulq    %rsi
10    addq    %rcx, %rdx
11    movq    %rax, (%rdi)
12    movq    %rdx, 8(%rdi)
13    ret

This code uses Three multiplications for the multiprecision arithmetic required to implement 128-bit arithmetic on a 64-bit machine. Describe the algorithm used to compute the product, and annotate the assembly code to show how it realizes your algorithm.


I tried to annotate each assembly code. But I'm totally lost from 4th instruction. I understood how each assembly code works, but during the progress of combining them together I'm lost.

1   store_prod:
2     movq    %rdx, %rax       // copy y to rax
3     cqto                     // sign-extend to upper 8 bytes of rax
4     movq    %rsi, %rcx       // copy x to rcx
5     sarq    $63, %rcx        // right arithmetic shift 63 times (why..?)
6     imulq   %rax, %rcx       // multiply rcx by rax (why..?)
7     imulq   %rsi, %rdx       // multiply rdx by x (why..?)
8     addq    %rdx, %rcx       // add rdx to rcx (why..?)
9     mulq    %rsi             // multiply by x [rax = x*y]
10    addq    %rcx, %rdx       // add rcx to xy (why..?)
11    movq    %rax, (%rdi)     // store rax at dest
12    movq    %rdx, 8(%rdi)    // store rdx at dest+8
13    ret                      //

Sorry for my broken English, I hope you understood what I'm saying.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
SteveLim
  • 31
  • 2
  • 1
    mov + sarq is "manually" doing the exact same thing as `cqto`, but for RSI into RCX:RSI. – Peter Cordes Nov 07 '19 at 14:29
  • 2
    As per Peter's comment, this is really a 128x128 multiply. Did you read the hints in the book? You should work out the maths of `(2^64*Xh + Xl) * (2^64*Yh + Yl)` to get a result of `2^64*Ph + Pl`. – Jester Nov 07 '19 at 14:32
  • Yup, agreed with @Jester. It looks like you didn't enable full optimization, only `-O1`. At `-O2` and higher, GCC notices (https://godbolt.org/z/Afpa-y) that the inputs are known to only be really 64-bit sign-extended to 128 so it can do the whole thing with a single `imul` instruction. BTW, you can also return `__int128`; the calling convention returns it in RDX:RAX. But making the compiler store the output is a nice way to explicitly see what output it produced on purpose. – Peter Cordes Nov 07 '19 at 14:32
  • 1
    You might want to take two `__int128` inputs; you'd get about the same asm but without sign-extension first. – Peter Cordes Nov 07 '19 at 14:35

0 Answers0