9

Consider the following program:

for i=1 to 10000000 do
  z <- z*z + c

where z and c are complex numbers.

What are efficient x86 assembler implementations of this program using x87 vs SSE and single vs double precision arithmetic?

EDIT I know I can write this in another language and trust the compiler to generate optimal machine code for me but I am doing this to learn how to write optimal x86 assembler myself. I have already looked at the code generated by gcc -O2 and my guess is that there is a lot of room for improvement but I am not adept enough to write optimal x86 assembler by hand myself so I am asking for help here.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
J D
  • 48,105
  • 13
  • 171
  • 274

3 Answers3

7

You don't need to do this in assembler per se - you can use SSE via intrinsics for an efficient implementation, particularly if you can use single precision.

temp.re = z.re * z.re - z.im * z.im;
temp.im = 2.0 * z.re * z.im;
z.re = temp.re + c.re;
z.im = temp.im + c.im;

If you shuffle your input vectors appropriately then you can get all the multiplies in one instruction (_mm_mul_ps) and the adds in a second instruction (_mm_hadd_ps).

If you need double precision then the same general principle applies but you'll need two multiplies and two horizontal adds.

Note that most modern x86 CPUs have two scalar FPUs so the benefit for double precision in SSE may not be worthwhile - single precision however should definitely be a win.


Here's an initial working implementation using SSE - I think it is more or less debugged now - performance is not much better than scalar code compiled with gcc -O3 though, as gcc does a pretty good job of generating SSE code for this:

static Complex loop_simd(const Complex z0, const Complex c, const int n)
{
    __m128 vz = _mm_set_ps(z0.im, z0.re, z0.im, z0.re);
    const __m128 vc = _mm_set_ps(0.0f, 0.0f, c.im, c.re);
    const __m128 vs = _mm_set_ps(0.0f, 0.0f, -0.0f, 0.0f);
    Complex z[2];
    int i;

    for (i = 0; i < n; ++i)
    {
        __m128 vtemp;

        vtemp = _mm_shuffle_ps(vz, vz, 0x16); // temp = { z.re, z.im, z.im, z.re }
        vtemp = _mm_xor_ps(vtemp, vs);        // temp = { z.re, -z.im, z.im, z.re }
        vtemp = _mm_mul_ps(vtemp, vz);        // temp = { z.re * z.re, - z.im * z.im, z.re * z.im, z.im * z.re }
        vtemp = _mm_hadd_ps(vtemp, vtemp);    // temp = { z.re * z.re - z.im * z.im, 2 * z.re * z.im, ... }
        vz = _mm_add_ps(vtemp, vc);           // temp = { z.re * z.re - z.im * z.im + c.re, 2 * z.re * z.im + c.im, ... }
    }
    _mm_storeu_ps(&z[0].re, vz);
    return z[0];
}

Note that the inner loop is just 6 SSE instructions (it really ought to be 5) + a little housekeeping for the loop itself:

L4:
    movaps  %xmm0, %xmm1
    shufps  $22, %xmm0, %xmm1
    xorps   %xmm3, %xmm1
    mulps   %xmm1, %xmm0
    haddps  %xmm0, %xmm0
    addps   %xmm2, %xmm0
    incl    %eax
    cmpl    %edi, %eax
    jne L4
L2:
Paul R
  • 208,748
  • 37
  • 389
  • 560
  • 1
    As I said above, it doesn't need to be raw asm - you can just use intrinsics in C/C++ code. – Paul R Apr 26 '12 at 08:53
  • Then I'd like to see working C/C++ code using intrinsics that I can compile in order to decypher the assembler. :-) – J D Apr 26 '12 at 09:00
  • OK - I just tried this with gcc and single precision SSE and was surprised to see that gcc could almost match my SSE implementation (about 20% difference) - so I guess the bottom line is: let the compiler do the heavy lifting. – Paul R Apr 26 '12 at 09:40
  • The problem with the shuffling of SSE registers that it tends to eat up most of the performance gain you win from the parallelism. But it's always worth a try. – Gunther Piez Apr 26 '12 at 10:31
  • @drhirsch: well there is only 1 shuffle per iteration, and it uses a different execution unit than the floating point operations, so it can potentially have zero cost. – Paul R Apr 26 '12 at 14:07
  • @PaulR This is exactly the kind of thing I was after. Thanks! – J D Apr 26 '12 at 15:07
  • @Jon: check out Andreas's solution also - it's more efficient **iff** you have 4 *independent* Z's that you can calculate in parallel. – Paul R Apr 26 '12 at 15:09
  • @PaulR Yes. That is of less interest to me than solving this problem. – J D Apr 27 '12 at 13:09
6

Look at the disassembly from your favorite compiler. If you're looking to perform this computation for several values of z and c (like calculating a mandelbrot image) I suggest you work on four values at once and put these in SSE registers. If you look at the code in Paul R's answer you could do all these calculations for four values at once:

__m128 z_im, z_re, c_im, c_re; //Four z and c values packed
__m128 re = _mm_sub_ps(_mm_mul_ps(z_re, z_re), _mm_mul_ps(z_im, z_im));
__m128 im = _mm_mul_ps(z_re, z_im);
im = _mm_add_ps(im, im); // Multiply by two
z_re = _mm_add_ps(re, c_re);
z_im = _mm_add_ps(im, c_im);
Andreas Brinck
  • 51,293
  • 14
  • 84
  • 114
  • +1: good idea, but probably won't use SSE, so will probably not be optimal – Paul R Apr 26 '12 at 08:44
  • I have already done this, of course. The output from `gcc -O2` is interesting (lots of `fxch` which surprised me because I didn't know it was "free") but no SSE. I'm looking for more efficient code. – J D Apr 26 '12 at 08:51
  • Try out the Intel compiler, last time I checked (five years ago or something) it was very good at utilising SSE. – Andreas Brinck Apr 26 '12 at 08:52
  • 1
    Note that there is a serial dependency in the loop so I doubt that ICC or any other vectorizing compiler will generate SSE code for this. – Paul R Apr 26 '12 at 08:55
  • @PaulR There is a serial dependency between z_n and z_n+1 but there is some potential for instruction-level parallelism within each iteration. – J D Apr 26 '12 at 09:02
  • @AndreasBrinck Is that C code with SSE intrinsics for MS VC++? – J D Apr 26 '12 at 09:05
  • @Jon Harrop Yes, but they map directly to SSE assembly instructions. – Andreas Brinck Apr 26 '12 at 09:12
  • @Jon: yes, I covered the parallelism that *can* be exploited per iteration in my answer, but my point was that a compiler will probably not be smart enough to do this and it can't do simple vectorization because of the serial dependency, – Paul R Apr 26 '12 at 09:12
  • There's a lot of redundancy in the above code - you really don't need all those multiplies. – Paul R Apr 26 '12 at 09:43
  • 1
    From what I can tell there's no redundancy at all if you want to do four calculations *at once*. – Andreas Brinck Apr 26 '12 at 10:34
  • @Andreas: you have 4 multiplies, a subtract and 2 adds, for a total of 7 SSE arithmetic instructions - see the implementation in my answer which uses 3 arithmetic instructions, only one of which is a multiply, plus a bitwise operation and a shuffle, for a total of 5. – Paul R Apr 26 '12 at 12:50
  • @Paul R But you're only calculating the value for a single `z`. So I have 7 / 4 = 1.75 instructions for each computation of `z`. – Andreas Brinck Apr 26 '12 at 13:56
  • @Andreas: oh yes - I see now - unfortunately that will only work if you have 4 *independent* Zs to calculate (and the same number of iterations for each), but maybe that is the case ? – Paul R Apr 26 '12 at 14:04
  • @Paul R Yes, I know. But even if the `z` are not completely independent (you may want to break out of the loop at a certain value) you can still calculate them in parallel while all `z` is below a certain value or something. I used the same approach in a raytracer which calculated four rays in parallel. – Andreas Brinck Apr 26 '12 at 14:24
  • @AndreasBrinck "they map directly to SSE assembly instructions". How is that possible when your program does not map directly to SSE registers? Presumably the compiler injects loads and stores when it needs to spill? – J D Apr 26 '12 at 15:08
  • @Jon Harrop Yes, that was an oversimplification on my part. – Andreas Brinck Apr 26 '12 at 15:10
  • Note that there are 8 SSE registers in 32 bit mode (and 16 SSE registers in 64 bit mode), so I doubt for the above example whether there would be any register spill. – Paul R Apr 26 '12 at 15:15
  • @PaulR Why are there more for 64-bit? – J D Apr 27 '12 at 13:02
  • @Jon: that's just the way that they designed the x86-64 architecture - I guess you don't need much silicon to add a few registers when you get the chance and it can make a big difference to performance. (I typically see anything up to 30% improvement for SSE code on x86-64 compared with x86). – Paul R Apr 27 '12 at 14:32
3

Z = Z*Z + C

That is the mandelbrot fractal iteration.

I'm sure you'll find highly optimized code for this all over the net. I would start at the sourcecode of Xaos and Fractint.

Nils Pipenbrinck
  • 83,631
  • 31
  • 151
  • 221