I´m doing a program that takes two matrix 4x4 and multiply them using Intrinsics. What I understand until now:
- MMX/SSE instructions set allow you to accelerate computing. In particular it uses a 4 bytes elements vector.
__m128
represents a 16 bytes vector (4 elements of 4 bytes). Furthermore,__m128
data needs to be aligned in order to work.
Where I get lost is here:
- Function
_mm_mul_ps(_m128, _m128)
that (as I have read) takes two vectors of 16 bytes of 4 flotats of 4 bytes. It multiply "one to one" the two vectors and returns a_m128
. But, what does that_m128
vector contains exactly (the result of what)? - Function
_mm_hadd_ps(_m128, _m128)
adds two 16 bytes vectors (each one of 4 bytes floats). It "adds horizontaly" this way:
vectorA(a1, a2, a3,a4) + vectorB(b1, b2, b3, b4) = vectorResult(a1 + a2, a3 + a4, b1 + b2, b3 + b4)
What I´m trying to do:
// Stores the result of multiply on row of A by one column of B
_declspec (align(16)) __m128 aux;
// Horizontal add
for(int i = 0; i < 4; i++){
for (int j = 0; j < 4; j++){
aux= _mm_mul_ps(vectorA[i], vectorB[j]);
// Add results
aux = _mm_hadd_ps(aux, aux);
aux = _mm_hadd_ps(aux,aux);
}
}
I can´t see how the functions work (I don´t have a "mental image").