0

I am trying to convert C++ code into x87 style inline assembly code.

C++ code:

  double a = 0.0, b = 0.0, norm2 = 0.0;
  int n;
  for (n = 0; norm2 < 4.0 && n < N; ++n) {
    double c = a*a - b*b + x;
    b = 2.0*a*b + y;
    a = c;
    norm2 = a*a + b*b;
  }

inline assembly code:

  double a = 0.0, b = 0.0, norm2 = 0.0;
  int n;
  for (n = 0; norm2 < 4.0 && n < N; ++n) { 
    // double c = a * a - b * b + x;
    __asm fld a 
    __asm fmul st(0), st(0) 
    __asm fld b 
    __asm fmul st(0), st(0) 
    __asm fsubp st(1), st(0) 
    __asm fld x 
    __asm faddp st(1), st(0) 
    __asm fstp c 

    // b = 2.0 * a * b + y;
    __asm fld two 
    __asm fld b 
    __asm fld a 
    __asm fmulp st(2), st(0) 
    __asm fmulp st(1), st(0) 
    __asm fld y
    __asm faddp st(1), st(0) 
    __asm fstp b

    // a = c
    __asm fld c
    __asm fstp a
    
    //norm2 = a * a + b * b;
    __asm fld a 
    __asm fmul st(0), st(0) 
    __asm fld b 
    __asm fmul st(0), st(0) 
    __asm faddp st(1), st(0) 
    __asm fstp norm2
  
  }

While my assembly code works, it is very slow. How can I speed it up?

user3702643
  • 1,465
  • 5
  • 21
  • 48
  • 3
    Find a compiler that supports your platform and get it to generate optimized code, then compare the two. – Richard Critten May 28 '21 at 12:18
  • 1
    This style of inline assembly is known to be inefficient due to a lot of reloads being needed. Try to write assembly functions entirely in assembly instead of using inline assembly. – fuz May 28 '21 at 12:20
  • @RichardCritten I tried using godbolt but I don't really understand it. I can't copy and paste and run the output from godbolt – user3702643 May 28 '21 at 12:28
  • Better question, is the original code slow when compiled with an optimizing compiler? Do you need assembly? Especially x87? Is this the bottleneck in your application? – Jester May 28 '21 at 12:31
  • Yes I need assembly and yes x87. It is part of the project specs. The original code is faster than my assembly code right now. I would like it to at least be the same speed – user3702643 May 28 '21 at 12:34
  • Which of the compilers on godbolt is the closest to your platform ? – Richard Critten May 28 '21 at 12:37
  • @RichardCritten https://gcc.godbolt.org/z/6WqcrjYaj this one. but I tried copying and pasting it but I cant get it to compile. – user3702643 May 28 '21 at 12:39
  • It just seems counterproductive to ask a compiler to generate code that you then copy into inline assembly instead of letting the compiler do its job as usual. – Jester May 28 '21 at 12:42
  • @user3702643 gcc does not support MSVC-style inline assembly, so it definitely isn't the one you are programming for. – fuz May 28 '21 at 12:44
  • @fuz I'm sorry about that. I'm not 100% sure on godbolt. However I am using msvc to compile – user3702643 May 28 '21 at 12:47
  • Visual Studio 2019 Developer Command Prompt v16.9.5 is what I am using to compile @fuz – user3702643 May 28 '21 at 12:49
  • @user3702643 Yes, the compiler is MSVC. It's kind of a garbage compiler. Consider using a better one like gcc, clang or icc. – fuz May 28 '21 at 12:50
  • @fuz the project specs require me to compile with cl /W3 /EHsc mandelbrot.cpp render_point.cpp user32.lib gdi32.lib. The cl command means I have to use MSVC right? – user3702643 May 28 '21 at 12:51
  • @user3702643 Yes, the `cl` program is for MSVC. Very unfortunate. – fuz May 28 '21 at 13:29
  • MSVC on Godbolt can make legacy x87 asm with `/arch:IA32` https://gcc.godbolt.org/z/KaaP6q3vs (https://learn.microsoft.com/en-us/cpp/build/reference/arch-x86?redirectedfrom=MSDN&view=msvc-160). I didn't notice any x87 syntax differences for the loop between GCC's Intel-syntax vs. MSVC; of course variable access needs to use just the C name, not an addressing mode. – Peter Cordes May 28 '21 at 18:42
  • AFAIK legacy x87 instructions tend to be less efficient on modern machines, whereas the compiler would be able to use SSE and other more recent extensions. You're trying to beat the compiler with one hand tied behind your back. It's an absurd project requirement unless you are deliberately retrocomputing or something. – Nate Eldredge May 28 '21 at 20:30
  • @NateEldredge They aren't really slower, but I believe they may have less execution units to run on. – fuz Jun 04 '21 at 09:12

1 Answers1

4

There's a lot to improve with this one. Here are some points to start with:

Do not program in MSVC-style inline assembly

MSVC-style inline assembly may be easy to program in, but it also forces all variables to live in memory. Every time you read from or assign to one of your variables, a slow memory access is performed. This hurts performance quite a bit.

Instead, write the whole function in assembly in a separate assembly file. If that isn't possible, at least start your assembly code with loading all variables into registers, then compute entirely on these registers and end the assembly section by writing the registers back to variables. This way, the amount of useless data movements is minimised.

When you do this, implement the for loop itself in assembly so you don't have to write out and then read back in all variables each iteration, but rather only once for the whole loop.

Keep as much values in registers as possible

As said before, all these fld and fstp instructions take time. Keep the numbers in registers so you don't have to constantly reload them. Also, if this isn't possible, at least merge loads and stores into the following instructions. For example, instead of

__asm fld x 
__asm faddp st(1), st(0) 

you could do

__asm fadd x

But it is much better to just keep everything in registers. For example, you could easily get rid of the c variable by just keeping it on the stack.

Do not perform work twice

Your code computes a*a and b*b twice: once in the previous iteration to compute norm2 and once in the next iteration to compute c. Compute these products once and keep them around to save you two multiplications.

Use cheaper instructions instead of more expensive ones.

Recall that 2x = x + x and replace an expensive load of a constant and a multiplication with an addition.

Also recall that a² - b² = (a + b)(a - b) to replace a multiplication with an addition. Note that this may change the rounding and is incompatible with the “do not perform work twice” advice. But perhaps it may be used for the initial iteration.

fuz
  • 88,405
  • 25
  • 200
  • 352
  • Thank you this is very helpful. Do you have tips on how to write the loop in assembly? I am struggling with that – user3702643 May 28 '21 at 12:53
  • @user3702643 What part specifically is it you are struggling with? The loop is just a label at the beginning of it and a conditional jump at the end that goes back to the beginning if the condition still holds. – fuz May 28 '21 at 13:28
  • the condition part of the loop. how do i do norm2 < 4.0 && n < N ? @fuz – user3702643 May 28 '21 at 13:37
  • 2
    @user3702643 For `n < N` use a `cmp` followed by a `jl`. For `norm2 < 4.0` it depends on whether you are allowed to use instructions introduced with the Pentium Pro or not. Are you allowed to do that? Or does the code have to run on older processors, too? – fuz May 28 '21 at 13:40
  • @fuz: to be fair, you could write the whole loop in one MSVC inline asm block. Then the store/reload round trip(s) is/are outside the loop. That's at least as good as a function call passing pointer args. – Peter Cordes May 28 '21 at 18:31
  • @PeterCordes I do in fact recommend this approach in the answer. – fuz May 28 '21 at 19:03
  • Your answer says to use a stand-alone asm file, and *when you do this*, to put the for loop inside it. Under a headline-sized "Do not program in MSVC-style inline assembly". :P So no, your answer doesn't say that you could get similar results from MSVC inline asm. – Peter Cordes May 28 '21 at 19:05
  • There's a little sentence starting with "if that isn't possible..." introducing the idea with the whole loop and such. – fuz May 28 '21 at 19:18