Suppose I have the following 4 vectors of doubles in Xeon Phi registers:
A-> |a8|a7|a6|a5|a4|a3|a2|a1|
B-> |b8|b7|b6|b5|b4|b3|b2|b1|
C-> |c8|c7|c6|c5|c4|c3|c2|c1|
D-> |d8|d7|d6|d5|d4|d3|d2|d1|
I want to permute them into the following:
A_new ->|d2|d1|c2|c1|b2|b1|a2|a1|
B_new ->|d4|d3|c4|c3|b4|b3|a4|a3|
C_new ->|d6|d5|c6|c5|b6|b5|a6|a5|
D_new ->|d8|d7|c8|c7|b8|b7|a8|a7|
The goal is to get :
O = _mm512_add_pd(_mm512_add_pd(A_new,B_new),_mm512_add_pd(C_new,D_new));
How can I achiever the above with the least number of instructions/cycles?