Speed up float 5x5 matrix * vector multiplication with SSE

Question

I need to run a matrix-vector multiplication 240000 times per second. The matrix is 5x5 and is always the same, whereas the vector changes at each iteration. The data type is float. I was thinking of using some SSE (or similar) instructions.

I am concerned that the number of arithmetic operations is too small compared to the number of memory operations involved. Do you think I can get some tangible (e.g. > 20%) improvement?
Do I need the Intel compiler to do it?
Can you point out some references?

Posting as a comment and not an answer as this is only speculation but don't some compilers optimize various matrix multiplication operations? I seem to remember an old university project of nested for-loop multiplication vs. threaded multiplication having an immensely faster run time because of optimization... — Grambot, Jul 07 '11 at 22:06
If you have written any code, please post. How many times is "an awful"? How long does it take today, and what would you like to reach? — Fredrik Pihl, Jul 07 '11 at 22:08
http://msdn.microsoft.com/en-us/library/y0dh78ez%28v=vs.80%29.aspx — OrangeDog, Jul 07 '11 at 22:09
What is the type of the data in the matrix and vector ? Single precision ? Double precision ? Integer ? — Paul R, Jul 07 '11 at 22:11
Also does this need to work on pretty much any x86 CPU or can we assume e.g. Intel and SSSE3 or later ? — Paul R, Jul 07 '11 at 22:16
SSE will do nothing if you're constantly cache missing, which is inevitably the case with 5x5 matrices (one of the matrices has poor ordering, and 5 is a bad number: you'll cache miss every loop iteration). Use Intel IPP and don't even remotely try to do this yourself. — Alexandre C., Jul 07 '11 at 22:30
@Alexandre C. : matrices? Plural? Question says "always the same". Besides, `5*5*sizeof(double)` is far, far less than the size of even an L1 cache. Why would you get cache misses? — MSalters, Jul 08 '11 at 10:01

score 9 · Accepted Answer · answered Jul 07 '11 at 22:17

9

The Eigen C++ template library for vectors, matrices, ... has both

optimised code for small fixed size matrices (as well as dynamically sized ones)
optimised code that uses SSE optimisations

so you should give it a try.

answered Jul 07 '11 at 22:17

Dirk Eddelbuettel

360,940
56
644
725

3

Note the Eigen docs claim that it doesn't perform well with fixed vectors with a size that's not a multiple of 16 bytes, so it may not automatically vectorize for this problem. Whether that's still the case with Eigen3 I can't say. – John L Jul 07 '11 at 22:29
Thanks for that note -- I was unaware of that restriction. But then I use dynamically sized vectors and matrices more anyway. – Dirk Eddelbuettel Jul 07 '11 at 22:31
@John L Thanks for you comment. Yes, I found the same in the documentation. Do you think its because of an underlying limit of SSE optimisation or of the library? Thanks! – Enzo Jul 08 '11 at 12:27
@Enzo: It's about SSE. SSE performs X, usually 4, flops in one instruction. If you're not a multiple of 4 (*4byte floats = 16bytes), then you can't express the operation in just SSE instructions. – Puppy Jul 08 '11 at 14:25
@Enzo - DeadMG is exactly right. If Eigen doesn't work out, try to roll your own. The MSDN docs on SSE intrinsics are pretty good, it's mostly the same for other compilers. – John L Jul 10 '11 at 01:28
@DeadMG, Enzo etc: You can use SSE instructions for vector/matrix sizes that are not multiples of 128 bits, as long as the size is >= 128 bits. As DeadMG alluded to, you need to split the loop into 3 parts - an 'initial' part that uses pure c++ iterations to align on a 128 bit boundary, a 'main' part that uses aligned SSE instructions to do the bulk of the work, and a 'final' part that uses pure c++ iterations to do any remainder. – Darren Engwirda Jul 10 '11 at 01:48
@everyone, what about zero-padding? Is it possible to do with vectors what can be done with polynomials? – Jacek Krysztofik Nov 14 '12 at 22:16
The fact that the 5x5 matrix is not a multiple of four is a red herring. You instead form a 5x4 matrix from the vectors and multiply the fixed 5x5 matrix on that. In principle you can get a 4x speedup with SSE. The same idea can be used for any size vector (e.g. 3D vectors). See my answer. – Mar 17 '13 at 20:26

score 5 · Answer 2 · 2013-05-20T10:10:33.640

In principle the speedup could be 4 times with SSE (8 times with AVX). Let me explain.

Let's call your fixed 5x5 matrix M. Defining the components of a 5D vector as (x,y,z,w,t). Now form a 5x4 matrix U from the first four vectors.

U =
xxxx
yyyy
zzzz
wwww
tttt

Next, do the matrix product MU = V. The matrix V contains the product of M and the first four vectors. The only problem is that for SSE we need read in the rows of U but in memory U is stored as xyzwtxyzwtxyzwtxyzwt so we have to transpose it to xxxxyyyyzzzzwwwwtttt. This can be done with shuffles/blends in SSE. Once we have this format the matrix product is very efficient.

Instead of taking O(5x5x4) operations with scalar code it only takes O(5x5) operations i.e. a 4x speedup. With AVX the matrix U will be 5x8 so instead of taking O(5x5x8) operations it only taxes O(5x5), i.e. a 8x speedup.

The matrix V, however, will be in xxxxyyyyzzzzwwwwtttt format so depending on the application it might have to be transposed to xyzwtxyzwtxyzwtxyzwt format.

Repeat this for the next four vectors (8 for AVX) and so forth until done.

If you have control over the vectors, for example if your application generates the vectors on the fly, then you can generate them in xxxxyyyyzzzzwwwwtttt format and avoid the transpose of the array. In that case you should get a 4x speed up with SSE and a 8x with AVX. If you combine this with threading, e.g. OpenMP, your speedup should be close to 16x (assuming four physical cores) with SSE. I think that's the best you can do with SSE.

Edit: Due to instruction level parallelism (ILP) you can get another factor of 2 in speedup so the speedup for SSE could 32x with four cores (64x AVX) and again another factor of 2 with Haswell due to FMA3.

ILP and FMA will also benefit up scalar; it's not unique to SIMD. At that point you're just calculating theoretical max FLOPS/clock, not *speedup* relative to scalar. — Peter Cordes, Jan 04 '20 at 10:44

score 4 · Answer 3 · answered Jul 07 '11 at 22:23

4

I would suggest using Intel IPP and abstract yourself of dependency on techniques

answered Jul 07 '11 at 22:23

Ulterior

2,786
3
30
58

1

They probably know a lot about tricky techniques to take advantage of Intel processor caches. You should compare to Eigen though, but imho IPP is a better product for this. – Alexandre C. Jul 07 '11 at 22:32

score 4 · Answer 4 · answered Jul 08 '11 at 01:04

4

If you're using GCC, note that the -O3 option will enable auto-vectorization, which will automatically generate SSE or AVX instructions in many cases. In general, if you just write it as a simple for-loop, GCC will vectorize it. See http://gcc.gnu.org/projects/tree-ssa/vectorization.html for more information.

answered Jul 08 '11 at 01:04

Jeremy Salwen

8,061
5
50
73

1

any decent compilers can do autovectorization, but only for some simple known pattern. For any other case you'll need to write vectorized code yourself, or use a library written with that in mind – phuclv Jan 03 '20 at 00:24

score 2 · Answer 5 · edited Jan 03 '20 at 00:24

This should be easy, especially when you're on Core 2 or later: You neeed 5* _mm_dp_ps , one _mm_mul_ps, two _mm_add_ps, one ordinary multiplication, plus some shuffles, loads and stores (and if the matrix is fixed, You can keep most of it in SSE registers, if you don't need them for anything else).

As for memory bandwidth: we're talking about 2,4 megabytes of vectors, when memory bandwidths are in single-digit gigabytes per second.

Fredrik Pihl · Answer 6 · 2011-07-07T22:40:01.697

1

What is known about the vector? Since the matrix is fixed, AND if there is a limited amount of values that the vector can take, then I'd suggest that you pre-compute the calculations and access them using a table look-up.

The classic optimization technique to trade memory for cycles...

edited Jul 07 '11 at 22:40

answered Jul 07 '11 at 22:24

Fredrik Pihl

44,604
7
83
130

1

It seems optimistic to me that there should be a reasonably limited amount of values the vector can take, but it might be not a problem to quantize the vectors accordingly. To to it better, one could then interpolate between those quantized vectors and get much better results – but this would likely be slower than a properly optimized straightforward matrix multiplication. – leftaroundabout Jul 08 '11 at 01:09
@leftaroundabout - perhaps, perhaps not. it's up to the OP to gather statistics on the input and then decide if this can be used or not. In a previous project, I found out that more than 95% of the calls to a highly complex calc-function had a **very** limited range, precalutating those, speeded things up by a magnitude or more. If not found in the table-lookup, then we'd resort to calculate from scratch. – Fredrik Pihl Jul 08 '11 at 08:08
Thanks for your reply! Unfortunately I can't do that. It is a real-time software, and the number of possible vectors is infinite. – Enzo Jul 08 '11 at 12:28

score 0 · Answer 7 · answered Jul 07 '11 at 22:41

0

I would recommend having a look at an optimised BLAS library, such as the Intel MKL or the AMD ACML. Based on your description I would assume that you'd be after the SGEMV level 2 matrix-vector routine, to do y = A*x style operations.

If you really want to implement something yourself, using the (available) SSE..SSE4 and AVX instruction sets can offer significant performance improvements in some cases, although this is exactly what a good BLAS library will be doing. You also need to think alot about cache friendly data access patterns.

I don't know if this is applicable in your case, but can you operate on "chunks" of vectors at a time?? So rather than repeatedly doing an y = A*x style operation can you operate on blocks of [y1 y2 ... yn] = A * [x1 x2 ... xn]. If so, this means that you could use an optimised matrix-matrix routine, such as SGEMM. Due to the data access patterns this may be significantly more efficient than repeated calls to SGEMV. If it were me, I would try to go down this path...

Hope this helps.

answered Jul 07 '11 at 22:41

Darren Engwirda

6,915
4
26
42

I'd expect that a fixed 5x5 matrix could be kept entirely in registers, so cache access wouldn't have a big effect (provided the vectors have a sane layout). Because of that, this seems like a pretty good problem for an introduction to SSE programming. Although that would still be my last resort, after trying compiler options and libraries. – John L Jul 10 '11 at 01:33
@John L: Eh?? You still need to load the registers before you can use them, and you definitely want to order your instructions so that you're doing this contiguously (maybe even with appropriate data prefetch as well). This is what I was getting at with "cache friendly access patterns"... :) – Darren Engwirda Jul 10 '11 at 01:57
1

the matrix doesn't change, so you only need to load it once before the iterations begin. The OP's problem is likely similar to `y[0] = i[0]; y[n] = m*(y[n-1])`. Only the new vector needs to be loaded at each iteration, which most programmers would do contiguously, and even if not the compiler is much more likely to spot it and reorder. – John L Jul 10 '11 at 09:36

Puppy · Answer 8 · 2011-07-07T22:51:28.560

If you know the vectors in advance (e.g., doing all 240k at once), you'd get a better speedup by parallelising the loop than by going to SSE. If you've already taken that step, or you don't know them all at once, SSE could be a big benefit.

If the memory is contiguous, then don't worry too much about the memory operations. If you've got a linked list or something then you're in trouble, but it should be able to keep up without too much problem.

5x5 is a funny size, but you could do at least 4 flops in one SSE instruction and try to cut your arithmetic overheads. You don't need the Intel compiler, but it might be better, I've heard legends about how it's much better with arithmetic code. Visual Studio has intrinsics for dealing with SSE2, and I think up to SSE4 depending on what you need. Of course, you'd have to roll it yourself. Grabbing a library might be the smart move here.

Speed up float 5x5 matrix * vector multiplication with SSE

8 Answers8

Linked