How does _mm_mul_ps() add two __m128?

Question

I´m doing a program that takes two matrix 4x4 and multiply them using Intrinsics. What I understand until now:

MMX/SSE instructions set allow you to accelerate computing. In particular it uses a 4 bytes elements vector.
__m128 represents a 16 bytes vector (4 elements of 4 bytes). Furthermore, __m128 data needs to be aligned in order to work.

Where I get lost is here:

Function _mm_mul_ps(_m128, _m128) that (as I have read) takes two vectors of 16 bytes of 4 flotats of 4 bytes. It multiply "one to one" the two vectors and returns a _m128. But, what does that _m128 vector contains exactly (the result of what)?
Function _mm_hadd_ps(_m128, _m128) adds two 16 bytes vectors (each one of 4 bytes floats). It "adds horizontaly" this way:
vectorA(a1, a2, a3,a4) + vectorB(b1, b2, b3, b4) = vectorResult(a1 + a2, a3 + a4, b1 + b2, b3 + b4)

What I´m trying to do:

// Stores the result of multiply on row of A by one column of B
    _declspec (align(16)) __m128 aux; 

        // Horizontal add
        for(int i = 0; i < 4; i++){
            for (int j = 0; j < 4; j++){
                aux= _mm_mul_ps(vectorA[i], vectorB[j]);
                // Add results
                aux = _mm_hadd_ps(aux, aux);
                aux = _mm_hadd_ps(aux,aux);
            }
        }

I can´t see how the functions work (I don´t have a "mental image").

Draw it out on paper - that always helps. SIMD operations are typically "vertical" (element-wise), but sometimes (e.g. at the end of a reduction) you need a "horizontal" operation (across the SIMD vector). — Paul R, Nov 17 '16 at 17:47
But what is the difference between the multiplying process with __m128 and the "traditional" matrix multiplication? I mean this one [link](http://imgur.com/xfztSis) — chick3n0x07CC, Nov 18 '16 at 08:32
It looks like your B matrix has already been transposed, so each iteration is multiplying a row of A by a column of B (actually a row, since it's now B'). Then the 4 product terms are summed horizontally, which requires two `_mm_hadd_ps` operations for a full horizontal add. So you get `sum(a1 * b1, a2 * b2, a3 * b3, a4 * b4)` in your inner loop. (Note that for a "traditional" implementation this would have been an additional loop to multiply and sum the 4 innermost terms). — Paul R, Nov 18 '16 at 09:02
@gallina0x07CC: Read the manuals: http://www.felixcloutier.com/x86/MULPS.html or https://software.intel.com/sites/landingpage/IntrinsicsGuide/ — Peter Cordes, Nov 18 '16 at 09:27

How does _mm_mul_ps() add two __m128?

0 Answers0