You don't need to do this in assembler per se - you can use SSE via intrinsics for an efficient implementation, particularly if you can use single precision.
temp.re = z.re * z.re - z.im * z.im;
temp.im = 2.0 * z.re * z.im;
z.re = temp.re + c.re;
z.im = temp.im + c.im;
If you shuffle your input vectors appropriately then you can get all the multiplies in one instruction (_mm_mul_ps
) and the adds in a second instruction (_mm_hadd_ps
).
If you need double precision then the same general principle applies but you'll need two multiplies and two horizontal adds.
Note that most modern x86 CPUs have two scalar FPUs so the benefit for double precision in SSE may not be worthwhile - single precision however should definitely be a win.
Here's an initial working implementation using SSE - I think it is more or less debugged now - performance is not much better than scalar code compiled with gcc -O3 though, as gcc does a pretty good job of generating SSE code for this:
static Complex loop_simd(const Complex z0, const Complex c, const int n)
{
__m128 vz = _mm_set_ps(z0.im, z0.re, z0.im, z0.re);
const __m128 vc = _mm_set_ps(0.0f, 0.0f, c.im, c.re);
const __m128 vs = _mm_set_ps(0.0f, 0.0f, -0.0f, 0.0f);
Complex z[2];
int i;
for (i = 0; i < n; ++i)
{
__m128 vtemp;
vtemp = _mm_shuffle_ps(vz, vz, 0x16); // temp = { z.re, z.im, z.im, z.re }
vtemp = _mm_xor_ps(vtemp, vs); // temp = { z.re, -z.im, z.im, z.re }
vtemp = _mm_mul_ps(vtemp, vz); // temp = { z.re * z.re, - z.im * z.im, z.re * z.im, z.im * z.re }
vtemp = _mm_hadd_ps(vtemp, vtemp); // temp = { z.re * z.re - z.im * z.im, 2 * z.re * z.im, ... }
vz = _mm_add_ps(vtemp, vc); // temp = { z.re * z.re - z.im * z.im + c.re, 2 * z.re * z.im + c.im, ... }
}
_mm_storeu_ps(&z[0].re, vz);
return z[0];
}
Note that the inner loop is just 6 SSE instructions (it really ought to be 5) + a little housekeeping for the loop itself:
L4:
movaps %xmm0, %xmm1
shufps $22, %xmm0, %xmm1
xorps %xmm3, %xmm1
mulps %xmm1, %xmm0
haddps %xmm0, %xmm0
addps %xmm2, %xmm0
incl %eax
cmpl %edi, %eax
jne L4
L2: