1

I have to accomplish the following using MIC's 512-bit vector units:

M->|b4|a4|b3|a3|b2|a2|b1|a1|
I->|d4|c4|d3|c3|d2|c2|d1|c1|

O-> O + |a4d4+b4c4|a4c4-b4d4|a3d3+b3c3|a3c3-b3d3|a2d2+b2c2|a2c2-b2d2|a1d1+b1c1|a1c1-b1d1|

The method I thought of was, similar to what Intel had proposed for SSE and which works with AVX too:

Using the _mm512_swizzle_pd() functions to form:

m0 = |a4|a4|a3|a3|a2|a2|a1|a1| and m0_t = |b4|b4|b3|b3|b2|b2|b1|b1| in0 = |d4|c4|d3|c3|d2|c2|d1|c1| and in0_r = |c4|d4|c3|d3|c2|d2|c1|d1|

Multiplying the above two and using something similar to an addsub_pd() for MIC. But there doesn't seem to be a corresponding intrinsic.

Any suggestions on how I can achieve this?

Intel's MIC (Xeon Phi) also has several FMA intrinsics like fmadd, fmsub, fnmadd, fnmsub which should lend itself to this situation and I have the following two approaches:

'O' is the output register
Approach 1 :
1. _mm512_fmadd_pd(m0,in0,O);
2. Explicitly set m0_t using _mm512_set_pd() to make it: |b4|-b4|b3|-b3|b3|-b3|b1|-b1| 
3. _mm512_fmadd_pd(m0_r,in0_r,O);

Approach 2:
1. _mm512_fmadd_pd(m0,in0,O);
2. _mm512_mask_fmadd_pd(m0_r,k1,in0_r,O); with k1=10101010
3. _mm512_mask_fnmadd_pd(m0_r,k2,in0_r,O); with k2=01010101

Is there a better approach? Any faults with these approaches?

sssylvester
  • 168
  • 6
user1715122
  • 947
  • 1
  • 11
  • 26

1 Answers1

1
tmp = _mm512_mul_pd(mo_t,in_r);
tmp = _mm512_mask3_fmadd_pd(m0,in0,tmp,k1); with k1=10101010
res = _mm512_mask3_fmsub_pd(m0,in0,tmp,k2); with k2=01010101

Why would you use _mm512_fnmadd_pd(v1,v2,v3)? The output for this intrinsics is (~(v1*v2)) - v3

user1584773
  • 699
  • 7
  • 19
  • Isn't that for _mm512_fnmsub_pd()? "Performs an element-by-element multiplication between float64 vector v1 and the float64 vector v2, then negates the result and subtracts float64 vector v3" – user1715122 Mar 12 '13 at 02:11
  • In addition to the above comment, is it possible to form m0 and m0_t from M. I was thinking about using _mm512_swizzle_pd(), but I don't think that will work. Any ideas? – user1715122 Mar 12 '13 at 03:42
  • well actually one can just use the masked variant of _mm512_swizzle_pd() – user1715122 Mar 13 '13 at 17:37