Using FMA (fused multiply) instructions for complex multiplication

Question

I'd like to leverage available fused multiply add/subtract CPU instructions to assist in complex multiplication over a decently sized array. Essentially, the basic math looks like so:

void ComplexMultiplyAddToArray(float* pDstR, float* pDstI, const float* pSrc1R, const float* pSrc1I, const float* pSrc2R, const float* pSrc2I, int len)
{
    for (int i = 0; i < len; ++i)
    {
        const float fSrc1R = pSrc1R[i];
        const float fSrc1I = pSrc1I[i];
        const float fSrc2R = pSrc2R[i];
        const float fSrc2I = pSrc2I[i];

        //  Perform complex multiplication on the input and accumulate with the output
        pDstR[i] += fSrc1R*fSrc2R - fSrc1I*fSrc2I;
        pDstI[i] += fSrc1R*fSrc2I + fSrc2R*fSrc1I;
    }
}

As you can probably see, the data is structured where we have separate arrays of real numbers and imaginary numbers. Now, suppose I have the following functions available as intrinsics to single instructions that perform ab+c and ab-c respectively:

float fmadd(float a, float b, float c);
float fmsub(float a, float b, float c);

Naively, I can see that I can replace 2 multiplies, one add, and one subtract with one fmadd and one fmsub, like so:

//  Perform complex multiplication on the input and accumulate with the output
pDstR[i] += fmsub(fSrc1R, fSrc2R, fSrc1I*fSrc2I);
pDstI[i] += fmadd(fSrc1R, fSrc2I, fSrc2R*fSrc1I);

This results in very modest performance improvements, along with, I assume, accuracy, but I think I'm really missing something where the math can be modified algebraically such that I can replace a couple more mult/add or mult/sub combinations. In each line, there's an extra add, and an extra multiply that I feel like I can convert to a single fma, but frustratingly, I can't figure out how to do it without changing the order of operations and getting the wrong result. Any math experts with ideas?

For the sake of the question, the target platform probably isn't that important, as I'm aware these kinds of instructions exist on various platforms.

Actually you are not reducing multiplies because `fSrc1I*fSrc2I` is still there. — user3528438, May 07 '15 at 00:46
Yes, but the preceding mult/sub has been replaced with a single fmsub, so it does speed things up very slightly. — Kumputer, May 07 '15 at 00:51
In one iteration, you are loading 6 `float`s, using multiplier 4 times, and using adder twice, writing 2 `floats`. By hiding addition into multiplication you are saving time on the adder, which isn't likely to be the bottleneck in this case. First thing I would try is to `restrict` those pointers to allow more aggressive scheduling of load/store instructions. After that, memory bandwidth becomes more of a concern than computation. — user3528438, May 07 '15 at 01:01
I forgot to mention that `restrict` is a C99 keyword. Although most compilers support its own version of equivalents in C++ mode, I still suggest compiling this function in C99 mode and use standard `restrict`. — user3528438, May 07 '15 at 01:21

score 4 · Accepted Answer · answered May 07 '15 at 01:03

4

That is a good start. You can reduce one more addition:

//  Perform complex multiplication on the input and accumulate with the output
pDstR[i] += fmsub(fSrc1R, fSrc2R, fSrc1I*fSrc2I);
pDstI[i] += fmadd(fSrc1R, fSrc2I, fSrc2R*fSrc1I);

Here you can make use of another fmadd in the calculation of the imaginary part:

pDstI[i] = fmadd(fSrc1R, fSrc2I, fmadd(fSrc2R, fSrc1I, pDstI[i]));

Likewise you can do the same with the real part, but you need to negate the argument. If this makes things faster or slower depend a lot on the micro-timing of the architecture you're working on:

pDstR[i] = fmsub(fSrc1R, fSrc2R, fmadd(fSrc1I, fSrc2I, -pDstR[i]));

Btw, you may get further performance improvements if you declare your destination arrays as non aliasing by using the restrict keyword. Right now the compiler must assume that pDstR and pDstI may overlap or point to the same chunk of memory. That would prevent the compiler from loading pDstI[i] before doing the write to pDstR[i].

Afterwards some careful loop-unrolling may also help if the compiler isn't already doing that. Check the assembler output of your compiler!

answered May 07 '15 at 01:03

Nils Pipenbrinck

83,631
31
151
221

Thanks for that. I actually was able to figure out the second fma on the imaginary part independently. Modifying just that line resulted in a 12% speedup using 8-wide __m256 AVX based loops using Intel FMA3. I then tried using your suggestion of negating the real destination by using xor after loading it, but then it slowed back down again. Thinking that this was due to poor pipelining of the odd xor, I tried just not negating it and accepting the wrong result, but the performance was still the same as before. Ultimately, it looks like just using the 2nd fma on the imaginary part wins. – Kumputer May 07 '15 at 01:58
Also, adding the restrict keyword for the destination pointers actually reduces the performance on the msvc11 compiler by 20%. Here's the asm output without restrict: – Kumputer May 07 '15 at 02:03
vmovups ymm4,ymmword ptr [rax] vmovups ymm5,ymmword ptr [rax+r12] vmovups ymm0,ymmword ptr [rax+r15] lea rax,[rax+20h] vmulps ymm1,ymm4,ymmword ptr [rax+r9-20h] vfmsub231ps ymm1,ymm5,ymmword ptr [rax+r13-20h] vfmadd231ps ymm0,ymm4,ymmword ptr [rax+r13-20h] vaddps ymm1,ymm1,ymmword ptr [rax+r14-20h] vfmadd231ps ymm0,ymm5,ymmword ptr [rax+r9-20h] vmovups ymmword ptr [rax+r14-20h],ymm1 vmovups ymmword ptr [rax+r15-20h],ymm0 – Kumputer May 07 '15 at 02:05
and with restrict: vmovups ymm1,ymmword ptr [rax] vmovups ymm3,ymmword ptr [rax+r9] vmovups ymm4,ymmword ptr [rax+r12] vmovups ymm2,ymmword ptr [rax+r15] lea rax,[rax+20h] vfmadd231ps ymm2,ymm1,ymmword ptr [rax+r13-20h] vmulps ymm1,ymm1,ymm3 vfmadd231ps ymm2,ymm4,ymm3 vfmsub231ps ymm1,ymm4,ymmword ptr [rax+r13-20h] vaddps ymm1,ymm1,ymmword ptr [rax+r14-20h] vmovups ymmword ptr [rax+r14-20h],ymm1 vmovups ymmword ptr [rax+r15-20h],ymm2 – Kumputer May 07 '15 at 02:07
I'm not sure how to format comments here as code blocks, but the assembly is very similar aside from there being an extra register usage after a move from memory in the restrict version, where the version without restrict just does a vmulps with the same pointer in the 3rd operand rather than the register, and the instructions have been reordered slightly. – Kumputer May 07 '15 at 02:11
@Kumputer put your code into `backticks` or it'll be much difficult to see/read – phuclv May 07 '15 at 03:40

score 3 · Answer 2 · answered May 07 '15 at 02:49

I've found the following (with a little help) seems to result in the correct answer:

pDstR[i] = fmsub(fSrc1R, fSrc2R, fmsub(fSrc1I, fSrc2I, pDstR[i]));
pDstI[i] = fmadd(fSrc1R, fSrc2I, fmadd(fSrc2R, fSrc1I, pDstI[i]));

But oddly, doesn't improve performance as much on AVX as leaving the real result part of the math using half FMA, but having the imaginary result use full FMA:

pDstR[i] += fmsub(fSrc1R, fSrc2R, fSrc1I*fSrc2I);
pDstI[i] = fmadd(fSrc1R, fSrc2I, fmadd(fSrc2R, fSrc1I, pDstI[i]));

Thanks everyone for the help.

Using FMA (fused multiply) instructions for complex multiplication

2 Answers2