I'd like to leverage available fused multiply add/subtract CPU instructions to assist in complex multiplication over a decently sized array. Essentially, the basic math looks like so:
void ComplexMultiplyAddToArray(float* pDstR, float* pDstI, const float* pSrc1R, const float* pSrc1I, const float* pSrc2R, const float* pSrc2I, int len)
{
for (int i = 0; i < len; ++i)
{
const float fSrc1R = pSrc1R[i];
const float fSrc1I = pSrc1I[i];
const float fSrc2R = pSrc2R[i];
const float fSrc2I = pSrc2I[i];
// Perform complex multiplication on the input and accumulate with the output
pDstR[i] += fSrc1R*fSrc2R - fSrc1I*fSrc2I;
pDstI[i] += fSrc1R*fSrc2I + fSrc2R*fSrc1I;
}
}
As you can probably see, the data is structured where we have separate arrays of real numbers and imaginary numbers. Now, suppose I have the following functions available as intrinsics to single instructions that perform ab+c and ab-c respectively:
float fmadd(float a, float b, float c);
float fmsub(float a, float b, float c);
Naively, I can see that I can replace 2 multiplies, one add, and one subtract with one fmadd and one fmsub, like so:
// Perform complex multiplication on the input and accumulate with the output
pDstR[i] += fmsub(fSrc1R, fSrc2R, fSrc1I*fSrc2I);
pDstI[i] += fmadd(fSrc1R, fSrc2I, fSrc2R*fSrc1I);
This results in very modest performance improvements, along with, I assume, accuracy, but I think I'm really missing something where the math can be modified algebraically such that I can replace a couple more mult/add or mult/sub combinations. In each line, there's an extra add, and an extra multiply that I feel like I can convert to a single fma, but frustratingly, I can't figure out how to do it without changing the order of operations and getting the wrong result. Any math experts with ideas?
For the sake of the question, the target platform probably isn't that important, as I'm aware these kinds of instructions exist on various platforms.