The fastest way to lock access to a data within single process on Linux

Question

I'm experimenting with locking data on Windows vs Linux.

The code I'm using for testing looks something like this:

#include <mutex>
#include <time.h>
#include <iostream>
#include <vector>
#include <thread>

using namespace std;

mutex m; 

unsigned long long dd = 0; 

void RunTest() 
{ 
    for(int i = 0; i < 100000000; i++) 
    {
        unique_lock<mutex> lck{m}; 
        //boost::mutex::scoped_lock guard(m1);
        dd++;
    }
}


int main(int argc, char *argv[]) 
{ 

    clock_t tStart = clock(); 
    int tCount = 0; 
    vector<shared_ptr<thread>> threads;
    for(int i = 0; i < 10;i++) 
    {
        threads.push_back(shared_ptr<thread>{new thread(RunTest)}); 
    }

    RunTest();    

    for(auto t:threads) 
    {
        t->join();
    }

    cout << ((double)(clock() - tStart)/CLOCKS_PER_SEC) << endl;

    return 0; //v.size(); 
}

I'm testing g++ -O3 vs Visual Studio 2013 compiled release mode.

When I use unique_lock<mutex> for sync, Linux beats Windows in most scenarios, sometimes significantly.

But when I use Windows' CRITICAL_SECTION, the situation reverses, and windows code becomes much faster than that on Linux, especially as thread count increases.

Here's the code I'm using for windows' critical section testing:

#include <stdafx.h>
#include <mutex>
#include <time.h>
#include <iostream>
//#include <boost/mutex>
#include <vector>
#include <thread>
#include<memory>

#include <Windows.h>

using namespace std;


mutex m;

unsigned long long dd = 0;

CRITICAL_SECTION critSec;

void RunTest()
{
    for (int i = 0; i < 100000000; i++)
    {
        //unique_lock<mutex> lck{ m };
        EnterCriticalSection(&critSec);
        dd++;
        LeaveCriticalSection(&critSec);
    }
}


int _tmain(int argc, _TCHAR* argv[])
{ 
    InitializeCriticalSection(&critSec);

    clock_t tStart = clock();
    int tCount = 0;
    vector<shared_ptr<thread>> threads;
    for (int i = 0; i < 10; i++)
    {
        threads.push_back(shared_ptr<thread>{new thread(RunTest)});
    }

    RunTest();

    for (auto t : threads)
    {
        t->join();
    }

    cout << ((double)(clock() - tStart) / CLOCKS_PER_SEC) << endl;
    DeleteCriticalSection(&critSec);

    return 0;
}

The way I understand why this is happening is that critical sections are process-specific.

Most of sync I'll be doing will be inside a single process.

Is there anything on Linux, which is faster than mutex or windows' critical section?

You must find something other than clock() to benchmark with, as clock() works wildly differently on windows and linux. (Linux measures spent CPU time for the process, which depends on the number of cores utilized, windows clock() gives you the wall clock time) — nos, Mar 27 '14 at 14:07

Non-maskable Interrupt · Answer 1 · 2014-03-28T16:52:57.743

0

First, your code have a huge race problem thus does not reflect any sane situation, and is very not suitable for benchmark. This is because, most mutex implementation are optimized in case where a lock can be acquired without waiting, in other case, ie high contention, which involve blocking a thread, then the mutex overhead become insignificant and you should redesign the system to get decent improvement, like split into multiple locks, use lockless algorithm or use transactional memory (available as TSX extension in some haswell processor, or software implementation).

Now, to explain the difference, CriticalSection on windows actually do a short time spinlock before resolving to thread-blocking mutex. Since blocking a thread involve order of magnitude overhead, in low contention situation a spinlock may greatly reduce the chance of getting into such overhead (Note that in high contention situation a spinlock actually make it worst).

On linux, you may want to look into fast userspace mutex, or futex, which adopt a similar idea.

edited Mar 28 '14 at 16:52

answered Mar 27 '14 at 13:54

Non-maskable Interrupt

3,841
1
19
26

It's not sane, and in production environment I wouldn't dream of writing code like that. But in my experience, locking counts for large chunk of wasted efficiency when writing multithread software, so I'm looking for a way to minimize that expense. Also, according to this http://stackoverflow.com/questions/3786947/futex-based-locking-mechanism linux mutexes are implemented with futexes – Arsen Zahray Mar 27 '14 at 15:31
In most situation you wouldn't worry on mutex's performance, since you would reduce lock contention by "good design", i.e. distributing the locks instead of single lock, or even lockless algorithm, if lock cannot be acquired you are expected to be long delayed due to thread blocked. As a side note if all you want is atomic counter, you can utilize the processor's native atomic actions like __sync_fetch_and_add (or _InterlockedIncrement in MSVC). – Non-maskable Interrupt Mar 28 '14 at 00:25
Spinlock: actually, I've tried to speed up the test with pthread spin-lock. It slowed it down signifficantly – Arsen Zahray Mar 28 '14 at 17:49
I want to try implementing the code using futex, but I can't find anywhere a working example on how to do this. Can you recommend something? – Arsen Zahray Mar 28 '14 at 18:29

The fastest way to lock access to a data within single process on Linux

1 Answers1