Wait until a variable becomes zero

Question

I'm writing a multithreaded program that can execute some tasks in separate threads.

Some operations require waiting for them at the end of execution of my program. I've written simple guard for such "important" operations:

class CPendingOperationGuard final
{
public: 
    CPendingOperationGuard()
    {
        InterlockedIncrementAcquire( &m_ullCounter );
    }

    ~CPendingOperationGuard()
    {
        InterlockedDecrementAcquire( &m_ullCounter );
    }

    static bool WaitForAll( DWORD dwTimeOut )
    {
        // Here is a topic of my question
        // Return false on timeout
        // Return true if wait was successful
    }

private:
    static volatile ULONGLONG m_ullCounter;
};

Usage is simple:

void ImportantTask()
{
    CPendingOperationGuard guard;
    // Do work
}

// ...

void StopExecution()
{
    if(!CPendingOperationGuard::WaitForAll( 30000 )) {
        // Handle error
    }
}

The question is: how to effectively wait until a m_ullCounter becames zero or until timeout.

I have two ideas:

To launch this function in another separate thread and write WaitForSingleObject( hThread, dwTimeout ):
```
DWORD WINAPI WaitWorker( LPVOID )
{
    while(InterlockedCompareExchangeRelease( &m_ullCounter, 0, 0 ))
        ;
}
```
But it will "eat" almost 100% of CPU time - bad idea.
Second idea is to allow other threads to start:
```
DWORD WINAPI WaitWorker( LPVOID )
{
    while(InterlockedCompareExchangeRelease( &m_ullCounter, 0, 0 ))
        Sleep( 0 );
}
```
But it'll switch execution context into kernel mode and back - too expensive in may task. Bad idea too

The question is:
How to perform almost-zero-overhead waiting until my variable becames zero? Maybe without separate thread... The main condition is to support stopping of waiting by timeout.

Maybe someone can suggest completely another idea for my task - to wait for all registered operations (like in WinAPI's ThreadPools - its API has, for instance, WaitForThreadpoolWaitCallbacks to perform waiting for ALL registered tasks).

PS: it is not possible to rewrite my code with ThreadPool API :(

Is there a specific reason not to use the C++ threading and atomics library? You use `final`, so C++11 seems available. C++11 also has `std::condition_variable` which lets you wait idle until you are notified by another thread. — walnut, Apr 10 '20 at 18:54
You should open your C++ book to the chapter that explains how to use `std::mutex` and `std::condition_variable`, and read it. Using `volatile` accomplishes absolutely nothing useful towards your goal. This is not what `volatile` is for. — Sam Varshavchik, Apr 10 '20 at 18:55
@JesperJuhl That might be the case, but the WinAPI calls used here seem to require it: https://learn.microsoft.com/en-us/previous-versions/windows/desktop/legacy/ms683618(v%3Dvs.85) — walnut, Apr 10 '20 at 19:00
@walnut - now all that needs to be done is to figure out how to use those specific Win API calls to implement "almost zero-overhead waiting", and then there will be an argument for using them with `volatile` objects. And if you can't, that means you can't use them, so `volatile`, again, will accomplish absolutely nothing. — Sam Varshavchik, Apr 10 '20 at 19:06
@walnut, yes, there is a reason to use WinAPI goods - the whole project is written with WinAPI and my project should match project's code style. — Georgy Firsov, Apr 10 '20 at 19:07
However, for now I don't understand, how to bind `std::condition_variable` with waiting for the variable to become zero :( — Georgy Firsov, Apr 10 '20 at 19:10
My suggestion would be continue using InterlockedIncrement/InterlockedDecrement and simply examine the return value from each decrement, when it reaches 0 set an event or some other notification (APC, custom IOCP packet, windows message). Do not attempt waiting on the variable itself. Remember, the return value from each InterlockedDecrement is the value after the decrement and that is guaranteed to be atomic. — SoronelHaetir, Apr 10 '20 at 19:56
@GeorgyFirsov *the whole project is written with WinAPI and my project should match project's code style* -- That is not a convincing argument to not use the C++11 memory model. If this code you're showing us is your own, and there are no other places where WinAPI **MultiThread** functions are used (not just WinAPI), then I don't see the issue. Also, if the code you're showing is isolated from the rest of the code base, again, I don't see an issue using C++11. — PaulMcKenzie, Apr 10 '20 at 20:04
@SoronelHaetir - *when it reaches 0 set an event or some other notification* - really it can many time reaches 0 - every time when no active operation. but after this new operation can begin. simply counter not enough here — RbMm, Apr 10 '20 at 22:28

Remy Lebeau · Accepted Answer · 2020-04-10T22:12:14.397

Have a look at the WaitOnAddress() and WakeByAddressSingle()/WakeByAddressAll() functions introduced in Windows 8.

For example:

class CPendingOperationGuard final
{
public: 
    CPendingOperationGuard()
    {
        InterlockedIncrementAcquire(&m_ullCounter);
        WakeByAddressAll(&m_ullCounter);
    }

    ~CPendingOperationGuard()
    {
        InterlockedDecrementAcquire(&m_ullCounter);
        WakeByAddressAll(&m_ullCounter);
    }

    static bool WaitForAll( DWORD dwTimeOut )
    {
        ULONGLONG Captured, Now, Deadline = GetTickCount64() + dwTimeOut;
        DWORD TimeRemaining;
        do
        {
            Captured = InterlockedExchangeAdd64((LONG64 volatile *)&m_ullCounter, 0);
            if (Captured == 0) return true;
            Now = GetTickCount64();
            if (Now >= Deadline) return false;
            TimeRemaining = static_cast<DWORD>(Deadline - Now);
        }
        while (WaitOnAddress(&m_ullCounter, &Captured, sizeof(ULONGLONG), TimeRemaining));
        return false;
    }

private:
    static volatile ULONGLONG m_ullCounter;
};

Raymond Chen wrote a series of blog articles about these functions:

WaitOnAddress lets you create a synchronization object out of any data variable, even a byte

Implementing a critical section in terms of WaitOnAddress

Spurious wakes, race conditions, and bogus FIFO claims: A peek behind the curtain of WaitOnAddress

Extending our critical section based on WaitOnAddress to support timeouts

Comparing WaitOnAddress with futexes (futexi? futexen?)

Creating a semaphore from WaitOnAddress

Creating a semaphore with a maximum count from WaitOnAddress

Creating a manual-reset event from WaitOnAddress

Creating an automatic-reset event from WaitOnAddress

A helper template function to wait for WaitOnAddress in a loop

this is not solution because wait when `m_ullCounter` became 0 not enough for task. assume we got that `m_ullCounter == 0` . and so what ? new operation can begin already after `m_ullCounter` became 0 — RbMm, Apr 11 '20 at 11:33
and as side note (oftopic) `InterlockedExchangeAdd64((LONG64 volatile *)&m_ullCounter, 0);` interlocked exist sense use only for ReadModifyWrite operation . it have no sense for read only, if processor can do read as atomic (aligned 32 bit integer atomic read on x86/x64 and think on any platform too) so simply read `m_ullCounter` is enough — RbMm, Apr 11 '20 at 12:37

RbMm · Answer 2 · 2020-04-11T12:24:20.610

you need for this task something like Run-Down Protection instead CPendingOperationGuard

before begin operation, you call ExAcquireRundownProtection and only if it return TRUE - begin execute operation. at the end you must call ExReleaseRundownProtection

so pattern must be next

if (ExAcquireRundownProtection(&RunRef)) {
    do_operation();
    ExReleaseRundownProtection(&RunRef);
}

when you want stop this process and wait for all active calls do_operation(); finished - you call ExWaitForRundownProtectionRelease (instead WaitWorker)

After ExWaitForRundownProtectionRelease is called, the ExAcquireRundownProtection routine will return FALSE (so new operations will not start after this). ExWaitForRundownProtectionRelease waits to return until all calls the ExReleaseRundownProtection routine to release the previously acquired run-down protection (so when all current(if exist) operation complete). When all outstanding accesses are completed, ExWaitForRundownProtectionRelease returns

unfortunately this api implemented by system only in kernel mode and no analog in user mode. however not hard implement such idea yourself

this is my example:

enum RundownState {
    v_complete = 0, v_init = 0x80000000
};

template<typename T>
class RundownProtection
{
    LONG _Value;

public:

    _NODISCARD BOOL IsRundownBegin()
    {
        return 0 <= _Value;
    }

    _NODISCARD BOOL AcquireRP()
    {
        LONG Value, NewValue;

        if (0 > (Value = _Value))
        {
            do 
            {
                NewValue = InterlockedCompareExchangeNoFence(&_Value, Value + 1, Value);

                if (NewValue == Value) return TRUE;

            } while (0 > (Value = NewValue));
        }

        return FALSE;
    }

    void ReleaseRP()
    {
        if (InterlockedDecrement(&_Value) == v_complete)
        {
            static_cast<T*>(this)->RundownCompleted();
        }
    }

    void Rundown_l()
    {
        InterlockedBitTestAndResetNoFence(&_Value, 31);
    }

    void Rundown()
    {
        if (AcquireRP())
        {
            Rundown_l();
            ReleaseRP();
        }
    }

    RundownProtection(RundownState Value = v_init) : _Value(Value)
    {
    }

    void Init()
    {
        _Value = v_init;
    }
};

///////////////////////////////////////////////////////////////

class OperationGuard : public RundownProtection<OperationGuard>
{
    friend RundownProtection<OperationGuard>;

    HANDLE _hEvent;

    void RundownCompleted()
    {
        SetEvent(_hEvent);
    }

public:

    OperationGuard() : _hEvent(0) {}

    ~OperationGuard() 
    {
        if (_hEvent)
        {
            CloseHandle(_hEvent);
        }
    }

    ULONG WaitComplete(ULONG dwMilliseconds = INFINITE)
    {
        return WaitForSingleObject(_hEvent, dwMilliseconds);
    }

    ULONG Init()
    {
        return (_hEvent = CreateEvent(0, 0, 0, 0)) ? NOERROR : GetLastError();
    }
} g_guard;

//////////////////////////////////////////////

ULONG CALLBACK PendingOperationThread(void*)
{
    while (g_guard.AcquireRP())
    {
        Sleep(1000);// do operation
        g_guard.ReleaseRP();
    }

    return 0;
}

void demo()
{
    if (g_guard.Init() == NOERROR)
    {
        if (HANDLE hThread = CreateThread(0, 0, PendingOperationThread, 0, 0, 0))
        {
            CloseHandle(hThread);
        }

        MessageBoxW(0, 0, L"UI Thread", MB_ICONINFORMATION|MB_OK);

        g_guard.Rundown();

        g_guard.WaitComplete();
    }
}

why simply wait when wait until a m_ullCounter became zero not enough

if we read 0 from m_ullCounter this mean only at this time no active operation. but pending operation can begin already after we check that m_ullCounter == 0 . we can use special flag (say bool g_bQuit) and set it. operation before begin check this flag and not begin if it true. but this anyway not enough

naive code:

//worker thread

if (!g_bQuit) // (1)
{
    // MessageBoxW(0, 0, L"simulate delay", MB_ICONWARNING);

    InterlockedIncrement(&g_ullCounter); // (4)
    // do operation
    InterlockedDecrement(&g_ullCounter); // (5)
}

// here we wait for all operation done

    g_bQuit = true; // (2)

    // wait on g_ullCounter == 0, how - not important
    while (g_ullCounter) continue; // (3)

pending operation checking g_bQuit flag (1) - it yet false, so it begin
worked thread is swapped (use MessageBox for simulate this)
we set g_bQuit = true; // (2)
we check/wait for g_ullCounter == 0, it 0 so we exit (3)
working thread wake (return from MessageBox) and increment g_ullCounter (4)

problem here that operation can use some resources which we already begin destroy after g_ullCounter == 0

this happens because check quit flag (g_Quit) and increment counter after this not atomic - can be a gap between them.

for correct solution we need atomic access to flag+counter. this and do rundown protection. for flag+counter used single LONG variable (32 bit) because we can do atomic access to it. 31 bits used for counter and 1 bits used for quit flag. windows solution use 0 bit for flag (1 mean quit) and [1..31] bits for counter. i use the [0..30] bits for counter and 31 bit for flag (0 mean quit). look for

Wait until a variable becomes zero

2 Answers2