WaitForSingleObject vs Interlocked*

Question

Under WinAPI there is WaitForSingleObject() and ReleaseMutex() function pair. Also there is Interlocked*() function family. I decided to check out performance between capturing single mutex and exchanging interlocked variable.

HANDLE mutex;
WaitForSingleObject(mutex, INFINITE);
// ..
ReleaseMutex(mutex);

// 0 unlocked, 1 locked
LONG lock = 0;
while(InterlockedCompareExchange(&lock, 1, 0))
  SwitchToThread();
// ..
InterlockedExchange(&lock, 0);
SwitchToThread();

I've measured performance between these two methods and found out that using Interlocked*() is about 38% faster. Why is it so?

Here's my performance test:

#include <windows.h>
#include <iostream>
#include <conio.h>
using namespace std;

LONG interlocked_variable   = 0; // 0 unlocked, 1 locked
int run                     = 1;

DWORD WINAPI thread(LPVOID lpParam)
{
    while(run)
    {
        while(InterlockedCompareExchange(&interlocked_variable, 1, 0))
            SwitchToThread();
        ++(*((unsigned int*)lpParam));
        InterlockedExchange(&interlocked_variable, 0);
        SwitchToThread();
    }

    return 0;
}

int main()
{
    unsigned int num_threads;
    cout << "number of threads: ";
    cin >> num_threads;
    unsigned int* num_cycles = new unsigned int[num_threads];
    DWORD s_time, e_time;

    s_time = GetTickCount();
    for(unsigned int i = 0; i < num_threads; ++i)
    {
        num_cycles[i] = 0;
        HANDLE handle = CreateThread(NULL, NULL, thread, &num_cycles[i], NULL, NULL);
        CloseHandle(handle);
    }
    _getch();
    run = 0;
    e_time = GetTickCount();

    unsigned long long total = 0;
    for(unsigned int i = 0; i < num_threads; ++i)
        total += num_cycles[i];
    for(unsigned int i = 0; i < num_threads; ++i)
        cout << "\nthread " << i << ":\t" << num_cycles[i] << " cyc\t" << ((double)num_cycles[i] / (double)total) * 100 << "%";
    cout << "\n----------------\n"
        << "cycles total:\t" << total
        << "\ntime elapsed:\t" << e_time - s_time << " ms"
        << "\n----------------"
        << '\n' << (double)(e_time - s_time) / (double)(total) << " ms\\op\n";

    delete[] num_cycles;
    _getch();
    return 0;
}

A mutex is a kernel object for cross-process synch, so every lock/unlock involves a context switch. [See here](http://msdn.microsoft.com/en-us/magazine/cc163726.aspx) for a relevant discussion. — Roger Rowland, Dec 13 '13 at 10:41
As far as I know, a Mutex is more suitable for synchronization between multiple processes. You might also want to try out a critical section. — Paul, Dec 13 '13 at 10:42
How did you test performance? Was there any actual contention on the locks? — Martin James, Dec 13 '13 at 11:43
Try a test with a lengthy CPU-intensive operation inside the lock that takes 100ms. Start 32 threads to get some contention going. See what happens. — Martin James, Dec 13 '13 at 11:46
@MartinJames i've added my performance test source code. inside thread change Interlocked*() by WaitFor*() to see WaitFor*() performance. First you enter number of threads to create then press enter, wait some time and press enter again. you will be able to see statistics. — Ivars, Dec 13 '13 at 12:02
Martin, uncontested lock acquisition time is often also a useful metric... — Len Holgate, Dec 13 '13 at 13:06
This spinlock implementation does one syscall per iteration, insofar it is a pretty silly implementation (the syscall being the major performance factor in acquiring the mutex, too). I will also not likely perform well (in any case not predictably) in presence of congestion. If a spinlock is applicable, it shouldn't yield like this, and if it's not applicable, a different mechanism should be used. — Damon, Dec 13 '13 at 13:55
@LenHolgate - of course, yes. I just wished to point out that it is not the only factor you should consider:) — Martin James, Dec 13 '13 at 16:06

score 4 · Accepted Answer · answered Dec 13 '13 at 10:42

WaitForSingleObject does not have to be faster. It covers a much wider scope of synchronization scenarios, in particular you can wait on handles which do not "belong" to your process and hence interprocess synchronization. Taking all this into consideration it is only 38% slower according to your test.

If you have everything inside your process and every nanosecond counts, InterlockedXxx might be a better option, but it's definitely not absolutely superior one.

Additionally, you might want to look at Slim Reader/Writer (SRW) Locks API. You will perhaps be able to build a similar class/functions based purely on InterlockedXxx with slightly better performance, however the point is that with SRW you get it ready to use out of the box, with documented behavior, stable and with decent performance anyway.

Len Holgate · Answer 2 · 2013-12-13T13:05:08.553

You are not comparing equivalent locks so it's not surprising that the performance is so different.

A mutex allows for cross process locking, it's likely one of the most expensive ways to lock due to the flexibility that it provides. It will usually put your thread to sleep when you block on a lock and this uses no cpu until you are woken up having gained the lock. This allows other code to use the cpu.

Your InterlockedCompareExchange() code is a simple spin lock. You will burn CPU waiting for your lock.

You might also want to look into Critical Sections (less overhead than a Mutex) and Slim Reader/Writer Locks (which can be used for mutual exclusion if you always obtain an exclusive lock and which provide fractionally faster performance than critical sections for non-contested use, according to my tests).

You might also want to read Kenny Kerr's "The Evolution of Synchronization in Windows and C++" and Preshing's lock related stuff, here and here.

WaitForSingleObject vs Interlocked*

2 Answers2