Windows threading synchronization performance issue

Question

I have a threading issue under windows.

I am developing a program that runs complex physical simulations for different conditions. Say a condition per hour of the year, would be 8760 simulations. I am grouping those simulations per thread such that each thread runs a for loop of 273 simulations (on average)

I bought an AMD ryzen 9 5950x with 16 cores (32 threads) for this task. On Linux, all the threads seem to be between 98% to 100% usage, while under windows I get this:

(The first bar is the I/O thread reading data, the smaller bars are the process threads. Red: synchronization, green: process, purple: I/O)

This is from Visual Studio's concurrency visualizer, which tells me that 63% of the time was spent on thread synchronization. As far as I can tell, my code is the same for both the Linux and windows executions.

I made my best to make the objects immutable to avoid issues and that provided a big gain with my old 8-thread intel i7. However with many more threads, this issue arises.

For threading, I have tried a custom parallel for, and the taskflow library. Both perform identically for what I want to do.

Is there something fundamental about windows threads that produces this behaviour?

The custom parallel for code:


    /**
     * parallel for
     * @tparam Index integer type
     * @tparam Callable function type
     * @param start start index of the loop
     * @param end final +1 index of the loop
     * @param func function to evaluate
     * @param nb_threads number of threads, if zero, it is determined automatically
     */
    template<typename Index, typename Callable>
    static void ParallelFor(Index start, Index end, Callable func, unsigned nb_threads=0) {

        // Estimate number of threads in the pool
        if (nb_threads == 0) nb_threads = getThreadNumber();

        // Size of a slice for the range functions
        Index n = end - start + 1;
        Index slice = (Index) std::round(n / static_cast<double> (nb_threads));
        slice = std::max(slice, Index(1));

        // [Helper] Inner loop
        auto launchRange = [&func] (int k1, int k2) {
            for (Index k = k1; k < k2; k++) {
                func(k);
            }
        };

        // Create pool and launch jobs
        std::vector<std::thread> pool;
        pool.reserve(nb_threads);
        Index i1 = start;
        Index i2 = std::min(start + slice, end);

        for (unsigned i = 0; i + 1 < nb_threads && i1 < end; ++i) {
            pool.emplace_back(launchRange, i1, i2);
            i1 = i2;
            i2 = std::min(i2 + slice, end);
        }

        if (i1 < end) {
            pool.emplace_back(launchRange, i1, end);
        }

        // Wait for jobs to finish
        for (std::thread &t : pool) {
            if (t.joinable()) {
                t.join();
            }
        }
    }

A complete C++ project illustrating the issue is uploaded here

Main.cpp:

//
// Created by santi on 26/08/2022.
//
#include "input_data.h"
#include "output_data.h"
#include "random.h"
#include "par_for.h"

void fillA(Matrix& A){

    Random rnd;
    rnd.setTimeBasedSeed();

    for(int i=0; i < A.getRows(); ++i)
        for(int j=0; j < A.getRows(); ++j)
            A(i, j) = (int) rnd.randInt(0, 1000);

}


void worker(const InputData& input_data,
            OutputData& output_data,
            const std::vector<int>& time_indices,
            int thread_index){

    std::cout << "Thread " << thread_index << " [" << time_indices[0]<< ", " << time_indices[time_indices.size() - 1] << "]\n";


    for(const int& t: time_indices){

        Matrix b = input_data.getAt(t);

        Matrix A(input_data.getDim(), input_data.getDim());
        fillA(A);

        Matrix x = A * b;

        output_data.setAt(t, x);
    }

}


void process(int time_steps, int dim, int n_threads){
    InputData input_data(time_steps, dim);
    OutputData output_data(time_steps, dim);

    // correct the number of threads
    if ( n_threads < 1 ) { n_threads = ( int )getThreadNumber( ); }

    // generate indices
    std::vector<int> time_indices = arrange<int>(time_steps);

    // compute the split of indices per core
    std::vector<ParallelChunkData<int>> chunks = prepareParallelChunks(time_indices, n_threads );

    // run in parallel
    ParallelFor( 0, ( int )chunks.size( ), [ & ]( int k ) {
            // run chunk
            worker(input_data, output_data, chunks[k].indices, k );
    } );
}

int main(){

    process(8760, 5000, 0);

    return 0;
}

There is no synchronization shown here, except for `join` which should only affect the calling thread. If there is a problem, it is probably in whatever `func` you are passing. — François Andrieux, Aug 25 '22 at 19:57
What does `func` do ? At first sight, it looks like the "user" code in `func` might simply be accessing construct that block on resources from other threads. Either paging memory pages in, waiting on mutices, IO from devices, etc. Without seeing what `func` does, or without replacing `func` with a simple computation, it's hard to be conclusive (Yeah, just what François said faster :-)) — Jeffrey, Aug 25 '22 at 20:01
I'll try to reproduce something that I can post. Conceptually, I am reading data arrays stored in the RAM (the main thread I guess) doing a calculations (in each thread) and storing the results in 2D arrays that I guess are hosted in the main thread. Everything is pre-allocated. @François Andrieux, indeed there is no explicit sincronization anywhere. Should there be?? — Santi Peñate-Vera, Aug 25 '22 at 21:25
@SantiPeñate-Vera You are asking about synchronization, so I would assume your code would contain synchronization. — François Andrieux, Aug 25 '22 at 22:48
With my limited knowledge, I assumed all that was taken care of automatically. In every tutorial, the most I see is the use of mutex when adding to a vector or something similar. Should I use a mutex every time I access another thread resource, even if the access is guaranteed to not to collision? — Santi Peñate-Vera, Aug 26 '22 at 06:41
I modified the question to add code to illustrate it. I uploaded the whole code here https://github.com/SanPen/parallel_testing_cpp/blob/main/main.cpp — Santi Peñate-Vera, Aug 26 '22 at 13:58
Try std::thread static subscription to cores with [ThreadAffinity](https://stackoverflow.com/questions/56486588/setting-thread-affinity-of-stdthreads). It will remove some cache-related uncertainties. Maybe look for some clues in windows thread API. — Bartosz Charuza, Aug 28 '22 at 17:24
Also have you attemped to use `std::for_each(std::execution::par_unseq,...`? That may yield better results than doing this yourself. Because it's more likely you're running into overhead creating threads because of how the standard says `std::thread` works. The stdlib can make use of things like the windows threadpool to make this less costly. Threads on windows are very expensive (for a variety of reasons, reasons Linus regrets not implementing), so they are intended to last for awhile. — Mgetz, Aug 29 '22 at 16:40

Sedenion · Answer 1 · 2022-08-30T05:45:40.100

The performance problem you see is definitely caused by the many memory allocations, as already suspected by Matt in his answer. To expand on this: Here is a screenshot from Intel VTune running on an AMD Ryzen Threadripper 3990X with 64 cores (128 threads):

As you can see, almost all of the time is spent in malloc or free, which get called from the various Matrix operations. The bottom part of the image shows the timeline of the activity of a small selection of the threads: Green means that the thread is inactive, i.e. waiting. Usually only one or two threads are actually active. Allocations and freeing memory accesses a shared resource, causing the threads to wait for each other.

I think you have only two real options:

Option 1: No dynamic allocations anymore

The most efficient thing to do would be to rewrite the code to preallocate everything and get rid of all the temporaries. To adapt it to your example code, you could replace the b = input_data.getAt(t); and x = A * b; like this:

void MatrixVectorProduct(Matrix const & A, Matrix const & b, Matrix & x) 
{
  for (int i = 0; i < x.getRows(); ++i) {
    for (int j = 0; j < x.getCols(); ++j) {
      x(i, j) = 0.0;
      for (int k = 0; k < A.getCols(); ++k) {
        x(i,j) += (A(i,k) * b(k,j));
      }
    }
  }
}


void getAt(int t, Matrix const & input_data, Matrix & b) {
  for (int i = 0; i < input_data.getRows(); ++i)
    b(i, 0) = input_data(i, t);
}


void worker(const InputData& input_data,
            OutputData& output_data,
            const std::vector<int>& time_indices,
            int thread_index){

    std::cout << "Thread " << thread_index << " [" << time_indices[0]<< ", " << time_indices[time_indices.size() - 1] << "]\n";

    Matrix A(input_data.getDim(), input_data.getDim());
    Matrix b(input_data.getDim(), 1);
    Matrix x(input_data.getDim(), 1);

    for (const int & t: time_indices) {
      getAt(t, input_data.getMat(), b);
      fillA(A);
      MatrixVectorProduct(A, b, x);
      output_data.setAt(t, x);
    }

    std::cout << "Thread " << thread_index << ": Finished" << std::endl;
}

This fixes the performance problems. Here is a screenshot from VTune, where you can see a much better utilization:

Option 2: Using a special allocator

The alternative is to use a different allocator that handles allocating and freeing memory more efficiently in multithreaded scenarios. One that I had very good experience with is mimalloc (there are others such as hoard or the one from TBB). You do not need to modify your source code, you just need to link with a specific library as described in the documentation.

I tried mimalloc with your source code, and it gave near 100% CPU utilization without any code changes. I also found a post on the Intel forums with a similar problem, and the solution there was the same (using a special allocator).

Additional notes

Matrix::allocSpace() allocates the memory by using pointers to arrays. It is better to use one contiguous array for the whole matrix instead of multiple independent arrays. That way, all elements are located behind each other in memory, allowing more efficient access.
But in general I suggest to use a dedicated linear algebra library such as Eigen instead of the hand rolled matrix implementation to exploit vectorization (SSE2, AVX,...) and to get the benefits of a highly optimized library.
Ensure that you compile your code with optimizations enabled.
Disable various cross-checks if you do not need them: assert() (i.e. define NDEBUG in the preprocessor), and for MSVC possibly /GS-.
Ensure that you actually have enough memory installed.

In this concrete case, the usage of dedicated libraries like Eigen is prescribed mostly not because of its manual vectorization, but rather its usage of the [expression templates](https://en.wikipedia.org/wiki/Expression_templates), which minimize the amount of allocations. — Ave Milia, Aug 29 '22 at 17:58
Fully agree, in the code o the bigger project I use Armadillo + MKL, but I wanted the dummy example to be self contained. @sedenion The mimalloc route is probably the one I need to use in the real-life code. I also Agree 100% on your first approach. — Santi Peñate-Vera, Aug 29 '22 at 19:08
Mimalloc improoves things quite a bit (for the example it is a silver bullet) — Santi Peñate-Vera, Aug 29 '22 at 19:53

score 9 · Answer 2 · answered Aug 28 '22 at 12:24

9

You said that all your memory was pre-allocated, but in the worker function I see this...

Matrix b = input_data.getAt(t);

which allocates and fills a new matrix b, and this...

Matrix A(input_data.getDim(), input_data.getDim());

which allocates and fills a new matrix A, and this...

Matrix x = A * b;

which allocates and fills a new matrix x.

The heap is a global data structure, so the thread synchronization time you're seeing is probably contention in the memory allocate/free functions.

These are in a tight loop. You should fix this loop to access b by reference, and reuse the other 2 matrices for every iteration.

answered Aug 28 '22 at 12:24

Matt Timmermans

53,709
3
46
87

How come that is not a issue under linux? In general I cannot avoid the creation and destruction of arrays in the real program, since the task is a very complex phisical simulation – Santi Peñate-Vera Aug 28 '22 at 12:27
Also, I have just moved the matrix A, b, x declarations out of the loop and the performance stays equally bad. – Santi Peñate-Vera Aug 28 '22 at 12:31
On linux you are probably using GCC, while under windows probably Visual C++? They have different allocator implementations. – Matt Timmermans Aug 28 '22 at 12:32
Under windows I have tested both VS2022 and GCC via MinGW. Equally bad. – Santi Peñate-Vera Aug 28 '22 at 12:34
Hmmm... I don't know if the windows version of the GCC allocator has all the same optimizations for concurrent use. I also can't say that you're comparing equivalent metrics between the two OSes. In any case, just moving the declarations is not going to fix the problem. Are all the input_data matrices the same size? – Matt Timmermans Aug 28 '22 at 12:42
In the dummy example provided yes, but in the main program I'm making, there are dozens of structures dense and sparse with different sizes. In abtract terms, think about the worker a a black box that takes read-only data and writes data into a pre-allocated output in such a way that no other thread can compete for the same bytes. Somehow the issue is with synchronization, but from my limited knowledge there should be no synchronization since there is no competition for the resources (at least conceptually) Maybe I should be using a different library or something else prepared for these tasks – Santi Peñate-Vera Aug 28 '22 at 12:47
Since they're the same size, you will be reusing `b` and `A`, but `*` and `setAt` also allocate new Matrices. In general, in writing code like this you should be careful not to allocate and free memory in tight loops. If you must, then a custom allocator can be used... but that's really not as good as just being careful. – Matt Timmermans Aug 28 '22 at 12:56
`HeapAlloc` which is what the default allocator on windows is is thread specific not process specific unless you deliberately pass the default process heap handle. – Mgetz Aug 29 '22 at 16:33

Windows threading synchronization performance issue

2 Answers2