2

I have to implement matrix-vector multiplication using sse/sse2. Vector and matrix are large. Matrix is double, vector is float.

The point is that all calculations I have to do on floats - when I get data from matrix I promote it to float, do the calculations and I get float vector (later after some additional calculations on floats I have to add some float values (float matrix) to double values (double matrix).

My question is how I can do it using SSE/SSE2 - the problem is with doubles - I have pointer to double* and I have to somehow convert 4 doubles into 4 floats to fit in __mm128... Are there any intructions to do that?

user606521
  • 14,486
  • 30
  • 113
  • 204
  • 2
    Surely you should be doing this the other way around, i.e. promote your float vector values to double and do the calculations at double precision, otherwise why bother with double precision in your matrix in the first place ? – Paul R Feb 28 '11 at 08:02
  • Okay, nobody understands it - I am working on neural networks, doubles matrix are neural network weights. When I learn my neural network I calculate change of weights which is really small - and if weights would be float then adding very small float value to another float would cause loose of precision - and when I add very small float value to double I don't loose precision. Ofcourse I could do everything on doubles but it's not neccessery - and operations on floats are faster. – user606521 Feb 28 '11 at 09:11
  • Okay my bad - u were right - on doubles will be faster and better :) – user606521 Feb 28 '11 at 15:13

2 Answers2

1

Changing from double to float is reducing the level of precision, not increasing it. For more accuracy, you should do the computations on doubles (promoting the vector to that type), then possibly cast the result back down to float afterwards. The instructions you need for conversion are cvtps2pd (float to double) and/or cvtpd2ps (double to float). Those only convert two values at a time (since only two doubles fit into an SSE register), so you will need to do your conversion in two parts.

Jeremiah Willcock
  • 30,161
  • 7
  • 76
  • 78
  • 1. The point is that I don't need double precision during calculations - I need it only when storing the result. – user606521 Feb 28 '11 at 08:39
  • 2. How I can use this convert to get 4 floats in one __mm128? As I understand this instruction sets [0..63] bits with 2 float values and sets [64 127] bits to 0. After two such operations I will have 2 __m128 with 2 floats each. The only thing that comes to my mind is to swap hi and low qword in one of __m128 and then logically AND them to get 4 floats in one __m128... Do you know better solution? – user606521 Feb 28 '11 at 08:44
  • Look at `movhlps` and `movlhps` for the data movement. – Jeremiah Willcock Feb 28 '11 at 16:29
  • 4
    What is the point of using double precision to store the results of single precision calculations? – Jeremiah Willcock Feb 28 '11 at 16:29
1

You need to call __m128 _mm_cvtpd_ps (__m128d a) (CVTDP2PS) twice to get two single precision float vectors, each containing two of your original double precision values, then merge these two float vectors into a single vector, using e.g. __m128 _mm_shuffle_ps(__m128 a, __m128 b, unsigned int imm8) (SHUFPS).

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • 1
    _mm_shuffle_ps is inefficient in this case, I suggest using `_mm_movelh_ps` or `_mm_movehl_ps` for combining two 'high' and 'low' vectors. – LiraNuna Mar 07 '11 at 21:39
  • 1
    @LiraNuna On Intel chips they are same speed, 1 cycle latency / 1 cycle throughput. But on older AMD like Jaguar, shufps is 4 times faster than movlhps or movhlps. – Soonts Nov 12 '18 at 16:19