Multi-threaded fixed-size matrix-vector multiplication optimized for many-core CPUs with non-uniform caches

Question

I would like to implement a parallel matrix-vector multiplication for a fixed size matrix (~3500x3500 floats) optimized for my CPUs and cache layout (AMD Zen 2/4) that is repeatedly executed for changing input vectors (set up time is not critical, sustained performance is). Programming language is C++.

Can anyone point me at good (perhaps optimal) strategies how to partition the matrix and the threads with respect to cache utilization and synchronization (reduction +=) overhead? Like what block size is best, and how to traverse the multiplication best with several threads? I'd then try to apply the strategy to my particular CPUs.

I'm free to duplicate matrix data for cache efficiency across multiple CCXs, and the matrix doesn't need to be contiguous in RAM either. I can choose any format and order that promises best efficiency.

Alternatively, I appreciate, too, if anybody knows of such a library or is able to share code. Don't need to reinvent things :)

Thanks.

Why not using BLAS libraries? They are perfectly made for this, and are highly-optimized since decades for many specific platforms. Reinventing the wheel does not seems a good idea. You can try [BLIS](https://github.com/flame/blis) for example. Actually, AMD recommand using it on their own CPUs. — Jérôme Richard, Feb 25 '23 at 01:15
I have tried several BLAS libraries. BLIS is not multi-threaded for level-2 operations. Slicing the matrix myself with several smaller BLIS multiplications doesn't perform well. OpenBLAS is multi-threaded, but doesn't perform (scale) well. It has no knowledge of the cache layout. Finally, I tried with MKL, which performs a lot better than OpenBLAS, but still has several issues - apart from the risk that Intel doesn't support AMD, and anytime it could become impossible to run MKL well performing on AMD. — northwindow, Feb 25 '23 at 01:22
Even MKL performance is probably not optimal because they doubt they optimize for the Zen architecture. In general, while BLAS has been around for a long time, I picture that most the famous and accessible implementations are not made for highly parallel MVMs on modern many-core CPUs. Also, BLAS need to setup the internals for each sgemv() call. BLAS API is tailored around matrices stored monolithic, and the do not reorder the data if beneficial. There's no such thing like a plan as in FFTW. BLAS is not optimized for repeated multiplications of the same matrix with a new vector. — northwindow, Feb 25 '23 at 01:31
Finally, a compile-time sized MVM is leaves more room for optimization than any dynamic algorithm can. — northwindow, Feb 25 '23 at 01:31
"Actually, AMD recommend using it on their own CPU", in my experience, everything that AMD recommends or optimized (FFTW, AOCC, etc) has no benefit over vanilla versions at best, or is even slower. I haven't found anything they recommend to improve the performance at the array sizes I work with. — northwindow, Feb 25 '23 at 01:35

Soonts · Answer 1 · 2023-02-26T17:37:00.300

4

First try Eigen. Depending on compiler, you might need manually define macros for proper SIMD, for Zen 2-3 you’d want EIGEN_VECTORIZE_AVX, EIGEN_VECTORIZE_FMA and EIGEN_VECTORIZE_AVX2, for Zen 4 also EIGEN_VECTORIZE_AVX512.
Also, be sure to enable OpenMP in project settings.

If you wanna try improving performance further, your #1 goal is saving memory bandwidth. Multiplying matrix by vector is practically guaranteed to bottleneck on memory, not compute.

Reshape the matrix into panels, like that.

The numbers in the table are 0-based indices of the elements in memory.
Only instead of 4, use panel height = 32 for AVX, or 64 for AVX512.
Also, don’t forget to align data by at least vector size, ideally by 64 bytes (cache line)

Note the last panel of the matrix probably needs zero-padding of these columns. And ideally, the output vectors also need a few extra elements to make their length a multiple of the panel height, otherwise you need special code to handle the last panel of the matrix.

In the inner loop, do something like that, untested.

// Compute product of width*32 matrix by vector of length `width`,
// the result is vector of length 32
void multiplyInner_avx( const float* mat, const float* vec, size_t width, float* rdi )
{
    // Initialize the accumulators
    __m256 acc0 = _mm256_setzero_ps();
    __m256 acc1 = _mm256_setzero_ps();
    __m256 acc2 = _mm256_setzero_ps();
    __m256 acc3 = _mm256_setzero_ps();

    // Compute these products
    const float* const vecEnd = vec + width;
    while( vec < vecEnd )
    {
        const __m256 v = _mm256_broadcast_ss( vec );
        vec++;

        acc0 = _mm256_fmadd_ps( v, _mm256_load_ps( mat ), acc0 );
        acc1 = _mm256_fmadd_ps( v, _mm256_load_ps( mat + 8 ), acc1 );
        acc2 = _mm256_fmadd_ps( v, _mm256_load_ps( mat + 16 ), acc2 );
        acc3 = _mm256_fmadd_ps( v, _mm256_load_ps( mat + 24 ), acc3 );
        mat += 32;
    }

    // Store the products
    _mm256_store_ps( rdi, acc0 );
    _mm256_store_ps( rdi + 8, acc1 );
    _mm256_store_ps( rdi + 16, acc2 );
    _mm256_store_ps( rdi + 24, acc3 );
}

For Zen 4 you gonna need another version of the above, to leverage AVX512 vectors.

In the outer loop, divide the matrix into batches of approximately equal size, so that count of batches equals to the count of hardware threads in your CPU. Dispatch each batch into different CPU threads, an easy way to do that is OpenMP.

Ideally, make sure the process is stable, i.e. that when you call your multiplication function for different vectors, same batches of the input matrix are dispatched into the same CPU cores.

Note there’s no reduction anywhere, and no need for that. Different CPU cores simply compute different slices of the output vector, by multiplying different slices of the input matrix by the complete input vector.

The magic numbers are more or less arbitrary. Experiment to find the best ones. For instance, it’s probably a good idea to increase panel height by a factor of 2, making multiplyInner_avx function to produce 64 elements of the output, at the cost of 4 more vector load + FMA instructions inside the loop.

Once you get a first working version, one possible micro-optimization — manually unroll the inner loop by 4, use _mm256_broadcast_ps to load 4 vector elements with a single instruction, then _mm256_permute_ps with constants _MM_SHUFFLE(i,i,i,i) where i in [0..3] interval to broadcast each of the 4 loaded vector elements into another vector.
After each permute, multiply/accumulate the broadcasted vector by the current column of the matrix, and advance the matrix pointer to the next column.

For the last 1-3 elements of the vector, use the current version with _mm256_broadcast_ss

The _mm256_permute_ps instruction is cheap, and most importantly on AMD it can run on vector port #2 (on Zen 4 also on port #3), these ports aren’t used by anything else in that loop.

I’m not 100% sure this gonna help because more instructions to decode and dispatch. But it might, because saves some memory loads.

edited Feb 26 '23 at 17:37

answered Feb 25 '23 at 19:31

Soonts

20,079
9
57
130

Great, thanks! I think I get the idea: multiply and accumulate each matrix column with the vector elements in parallel along the rows of the matrix (both data parallel (SIMD) and task parallel (threads)), that is column after column, with a reordered matrix such that matrix data is contiguous for this access sequence (and well aligned and correspondingly local in the cache L3 cache for the threads). Makes sense to me and sounds like this could be most efficient. I'll implement that and report. – northwindow Feb 25 '23 at 22:23
1

@northwindow Right, the main performance wins of this method are sequential vectorized loads from the matrix, and the fact that every loaded element from the vector is reused to compute 32 output elements. Your vector only uses ~14 kb of memory (as opposed to ~47 MB matrix), it’s OK duplicate the vector into L1d/L2 caches of every core. – Soonts Feb 26 '23 at 14:22
2

Soonts, based on your original function, I wrote an implementation for a 2048x2048 matrix and 64 threads (to start with) using my own threads that is about 20% faster than MKL for the same matrix size and number of threads :-) So it looks promising and I'll generalize it. Also, it doesn't have this problem https://stackoverflow.com/questions/75548704/intel-mkl-multi-threaded-matrix-vector-multiplication-sgemv-slow-after-little?noredirect=1#comment133301932_75548704 :-) This alone is huge benefit. – northwindow Feb 28 '23 at 02:18
@northwindow Nice. However, if you have that many hardware threads, the 32-tall panels might result in too few panels in the matrix, not enough to evenly split the work across threads. You could try reducing panel height by half, and the following function for the inner loop: https://gist.github.com/Const-me/2b383e9129c2f343fb874c56055d189e – Soonts Feb 28 '23 at 13:18
One more note, if you’re on Windows, consider the built-in thread pool. Example there: https://github.com/Const-me/Whisper/blob/master/Whisper/Utils/parallelFor.h and the corresponding parallelFor.cpp source file in the same folder. I don’t really know how they implemented waits, but maybe with GetQueuedCompletionStatus kernel API. – Soonts Feb 28 '23 at 13:24
Soonts, nice! This is even faster for the 2048x2048 matrix on 64 cores. MKL is about 12.5 us in average of 1,000,000 multiplications, and your latest version is down to about 9.5 us. – northwindow Mar 01 '23 at 01:01
@northwindow Try another version. Same panel height 16, unrolled the loop more, might help: https://gist.github.com/Const-me/15111c4f9502b001eef31ffde4aa7770 BTW, note all these versions are producing slightly different results due to different summation order, and FP precision shenanigans. MKL probably uses yet another one. – Soonts Mar 01 '23 at 12:21
Soonts, very nice! Down to 9.0 us! I'm not observing statistically significant differences between ICC 2021.8, GCC 10.2.1, Clang 11.0, and AOCC 4.0 (clang 14). – northwindow Mar 01 '23 at 19:52
Soonts, could you explain why using more accumulators is beneficial? Thanks! – northwindow Mar 01 '23 at 19:54
Btw, I get the same performance if I split the 64 threads over two sockets. This is great, and also something that I didn't get with MKL or OpenBLAS and always wondered why. – northwindow Mar 01 '23 at 20:04
@northwindow About the accumulators and unrolling, I think the main reason is pipelining. A processor core runs many instructions in parallel, but it can’t start computing `a+=b*c;` before it got the previous value of the `a`. Modern AMD processors can complete 2 of these FMAs every cycle in each core, but the latency is 4-5 cycles in the best case, 5 cycles on Zen2, 4 cycles on the newer ones. When you have multiple independent accumulators like `a0 += b*c0; a1 += b*c1; a2 += b*c2; a3 += b*c3;` these 4 instructions don’t have data dependencies between them, all 4 of them will run in parallel. – Soonts Mar 01 '23 at 20:54
About the sockets, I don’t know the reason. I have limited experience with MKL, and I don’t know how specifically they parallelize that matrix*vector multiplication in their library. – Soonts Mar 01 '23 at 20:58
Soonts, I'm not quite understanding how to interpret the picture with the ordering above. For the `multiplyInner_avx()` code, the matrix would need to be stored in column-major format (including padding if needed), isn't that correct? However, this isn't how I'm reading the picture. What am I confusing? – northwindow Mar 03 '23 at 21:12
@northwindow The matrix needs to be reordered into a sequence of panels, of height 16. Each panel is indeed a column-major matrix of size width*16, but the complete matrix composed from a sequence of these panels is neither row major nor column major. On the picture the panel height is 4, otherwise the illustration would be too large to be readable, and the illustration has 2 of these panels. And yes, the last panel of the matrix indeed very likely to require zero padding, unless the height of the matrix is a multiple of 16. – Soonts Mar 03 '23 at 21:24
One more thing, I’m not sure how specifically MKL treats input matrices for your product. It could be you need to transpose the matrix before producing the panels, possible to do on the fly while reshaping. – Soonts Mar 03 '23 at 21:28
Ah, okay, so your two rows of panels are for two different threads? That's what I implemented. Each thread gets a column-major re-ordered portion of the original matrix. – northwindow Mar 03 '23 at 21:34
@northwindow Yeah, after you reshape the matrix that way, assign continuous subsets of these panels to different OS threads, to parallelize things. If your matrix is so small that it only has 2 panels, you’ll only be able to leverage 2 threads. – Soonts Mar 03 '23 at 21:40
I won't be using MKL in this application, just as a reference benchmark, but your intrinsics code (not sure which one yet). Using my own threads, etc, no MKL, or OpenMP. – northwindow Mar 03 '23 at 21:40
@northwindow Different libraries are using different conventions. When I wrote my answer, I assumed you want a vector with elements equal to dot products of input vector and matrix rows. It could be the opposite, dot products of input vector and matrix columns (some libraries are doing that when you ask them to multiply vector\*matrix, as opposed of matrix\*vector). If that’s the case my answer’s still good, but you need different reshaping function for the matrix. – Soonts Mar 03 '23 at 21:49
Your assumption was correct – northwindow Mar 03 '23 at 21:52
Added F16C and AVX-512 versions. – northwindow Mar 13 '23 at 22:02

northwindow · Accepted Answer · 2023-03-09T02:38:19.563

Here's a multi-threaded SIMD 32-bit MVM implementation, based on @Soonts suggestions, that performs somewhat decently on an AMD 7742.

https://gist.github.com/eastwindow/022668c9b495d932dc49dd8c45c81a7d

4096 x 4096 matrix: 30 us

3456 x 3456 matrix: 23 us

2048 x 2048 matrix: 7 us

1632 x 1632 matrix: 5.5 us

Synchronization overhead of 64 threads (no MVM): 2 us

All timings are averages of 1,000,000 repetitions with no break between repetitions.

However, with a break between repetitions, e.g. set wait = 200000; and remove multiplyInner_omp_simd_a() in void stay_busy( int threadnr ), the MVMs take a lot longer. Adding _mm_pause() doesn't help, neither do other things that I tried to keep the AVX units busy.

Is there another way to keep the MVM fast after a break other than doing the MVM continuously?

(The system is well tuned for low-latency and determinism. The cores on the socket are completely isolated, any power-saving feature in the CPU and frequency up- and down-scaling is disabled)

Another thing I find noteworthy is the synchronization with atomics. If each atomic is on its own cache line, the synchronization takes the longest (4 us). Using 4 atomics per cache line results in the lowest overhead, 2 us, despite the false sharing. It seems as if the cache coherence protocols are executed faster than loading from 64 separate cache lines.

Multi-threaded fixed-size matrix-vector multiplication optimized for many-core CPUs with non-uniform caches

2 Answers2

Linked