Horrible scalability of std::mutex with zero contention

Question

Even without contention, the scalability of std::mutex seems to be horrible. This is a case where every thread is guaranteed to use its own mutex. What is going on?

#include <mutex>
#include <vector>
#include <numeric>

void TestThread(bool *pbFinished, int* pResult)
{
    std::mutex mtx;
    for (; !*pbFinished; (*pResult)++)
    {
        mtx.lock();
        mtx.unlock();
    }
}

void Test(int coreCnt)
{
    const int ms = 3000;
    bool bFinished = false;
    std::vector<int> results(coreCnt);    
    std::vector<std::thread*> threads(coreCnt);

    for (int i = 0; i < coreCnt; i++)
        threads[i] = new std::thread(TestThread, &bFinished, &results[i]);

    std::this_thread::sleep_for(std::chrono::milliseconds(ms));

    bFinished = true;
    for (std::thread* pThread: threads)
        pThread->join();

    int sum = std::accumulate(results.begin(), results.end(), 0);
    printf("%d cores: %.03fm ops/sec\n", coreCnt, double(sum)/double(ms)/1000.);
}

int main(int argc, char** argv)
{
    for (int cores = 1; cores <= (int)std::thread::hardware_concurrency(); cores++)
        Test(cores);

    return 0;
}

Results in Windows are very bad:

1 cores: 15.696m ops/sec
2 cores: 12.294m ops/sec
3 cores: 17.134m ops/sec
4 cores: 9.680m ops/sec
5 cores: 13.012m ops/sec
6 cores: 21.142m ops/sec
7 cores: 18.936m ops/sec
8 cores: 18.507m ops/sec

Linux manages to be an even bigger loser:

1 cores: 46.525m ops/sec
2 cores: 15.089m ops/sec
3 cores: 15.105m ops/sec
4 cores: 14.822m ops/sec
5 cores: 14.519m ops/sec
6 cores: 14.544m ops/sec
7 cores: 13.996m ops/sec
8 cores: 13.869m ops/sec

I have also tried using tbb's readers/writer lock, and even rolled my own.

Get rid of the false sharing by having each thread increment `*pResult` as it terminates and see if it makes the huge different that I expect it will make. Your threads are fighting over the `results` vector. — David Schwartz, Oct 10 '22 at 22:46
Not that I think it is likely to make a difference, but your program has a data race. You should make `bFinished` a `std::atomic`. Also you are not destroying your `std::thread` objects, so they may hold on to whatever system resources they may use. (Again I don't think that matters for your results though.) Just put them directly into the vector, not pointers. — user17732522, Oct 10 '22 at 22:53
@DavidSchwartz Oops, good catch, thanks. I'm trying to create a reproducible from a larger program with this issue, looks like I need to keep working on that. Getting rid of the false sharing fixes this test case. I wonder if there is false sharing elsewhere in the larger program, the results are similar to the false sharing ones here. — Tyson Jacobs, Oct 10 '22 at 22:54
Another point of contention. You've got the first thread off to the races before the other thread are even instantiated. That might skew your results. Consider having all threads start together looping and ending together. You can use a semaphore or conditional variable to synchronize the ready state of threads. — selbie, Oct 10 '22 at 22:54
Also, of very important. Just because you invoked `this_thread::sleep_for(ms);`, doesn't mean it slept for exactly that long. Hence, dividing results by `ms` isn't entirely correct. A better approach would be to check the timestamp before and after sleeping to determine how long your main thread actually slept. Or preferably, each thread does this effort. Instead of 3 seconds, do 15 seconds to eliminate some of the jitter. — selbie, Oct 10 '22 at 22:56

selbie · Answer 1 · 2022-10-11T06:13:13.057

I made my own variant of your test with the following changes:

Each test thread executes for a specific number of iterations instead for a specific amount of time. Each thread returns how long it took to run the number of iterations. (For testing, I used 20 million iterations).
The main thread orchestrating the threads waits for each thread to "signal" that its ready to begin. Then the main thread, upon seeing all threads are "ready", it signals "go" to all the tests. These signals are basically condition_variables. This basically eliminates performance noise of having one thread starting to execute while another thread is getting warmed up the kernel.
No attempt by the thread to access a global variable until the thread exits to return results.
When all the threads have finished, the total number of iterations is computed with the total amount of time each thread took.
used the high resolution clock to measure time used in each thread

struct TestSignal
{
    std::mutex mut;
    std::condition_variable cv;
    bool isReady;

    TestSignal() : isReady(false)
    {

    }

    void Signal()
    {
        mut.lock();
        isReady = true;
        mut.unlock();
        cv.notify_all();
    }

    void Wait()
    {
        std::unique_lock<std::mutex> lck(mut);
        cv.wait(lck, [this] {return isReady; });
    }
};

long long TestThread2(long long iterations, TestSignal& signalReady, TestSignal& signalGo)
{
    std::mutex mtx;

    signalReady.Signal(); // signal to the main thread we're ready to proceed
    signalGo.Wait();      // wait for the main thread to tell us to start

    auto start = std::chrono::high_resolution_clock::now();

    for (int i = 0; i < iterations; i++)
    {
        mtx.lock();
        mtx.unlock();
    }

    auto end = std::chrono::high_resolution_clock::now();

    auto diff = end - start;

    auto milli = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
    return milli.count(); // return how long it took to execute the iterations
}


void Test2(unsigned int threadcount)
{

    long long iterations = 20000000; // 20 million

    std::vector<std::thread> threads(threadcount);
    std::vector<TestSignal> readySignals(threadcount);
    std::vector<long long> results(threadcount);

    TestSignal signalGo;

    for (unsigned int i = 0; i < threadcount; i++)
    {
        auto t = std::thread([&results, &readySignals, &signalGo, i, iterations] {results[i] = TestThread2(iterations, readySignals[i], signalGo); });
        readySignals[i].Wait();
        threads[i] = std::move(t);
    }

    std::this_thread::sleep_for(std::chrono::milliseconds(500));

    signalGo.Signal(); // unleash the threads

    for (unsigned int i = 0; i < threadcount; i++)
    {
        threads[i].join();
    }

    long long totaltime = 0;
    double totalrate = 0;
    for (unsigned int i = 0; i < threadcount; i++)
    {
        double rate = iterations / (double)(results[i]); // operations per millisecond
        totalrate += rate;
    }

    std::cout << threadcount << " threads: " << totalrate/1000 << "m ops/sec (new test)\n";
    
}

Then a simple main to compare both results 3x times:

int main()
{
#ifdef WIN32
    ::SetPriorityClass(GetCurrentProcess(), HIGH_PRIORITY_CLASS);
#endif

    Test(std::thread::hardware_concurrency());
    Test2(std::thread::hardware_concurrency());

    Test(std::thread::hardware_concurrency());
    Test2(std::thread::hardware_concurrency());

    Test(std::thread::hardware_concurrency());
    Test2(std::thread::hardware_concurrency());


    return 0;
}

The results are noticably different:

12 cores: 66.343m ops/sec
12 threads: 482.187m ops/sec (new test)
12 cores: 111.061m ops/sec
12 threads: 474.199m ops/sec (new test)
12 cores: 66.758m ops/sec
12 threads: 481.353m ops/sec (new test)

Horrible scalability of std::mutex with zero contention

1 Answers1