-1

After 5 days working on this (3 major rewrites), I seek help from you wise people on StackOverflow. I’ve used atomic int(s) before, but not this extensively.

THE SCENARIO; I have batches, 1..1000+ batches, of test labeled A..Z. Batches set 2, cant be processed until it knows the test results processed of batch 1 and so on.

    1)  A B C..Z
    2)  A B C..Z
    .
    .
   99)
100++)  A B C..Z

STAGE3 (step three) – utilizes a global Atomic Int, to assign a local value index (fetch_add), instead of a “Do, while“ loop, looking for the next free “test” (removing the need for loops and looking for locks to be free on each test). It Works beautifully.

void Stage3(int ThreadsLevel)
{
    int index;
    CurrentWorkers.fetch_add(1); //Workers Entering
    index = IndexManager.fetch_add(1); //Assign Thread His Job (A..Z)
    do
    {
        //
        // BODY OF ALL THE WORK TO DO
        //

        index = IndexManager.fetch_add(1);
    } while (index < NumberOfTest);
    CurrentWorkers.fetch_sub(1); //Workers Exiting
}

STAGE2 (a gate,) stops any straggler threads (windows might borrow and return), from joining in if work almost done, and sends the Thread forward to begin the next batches (free flowing threads), once the final thread is done. I know exactly when the work is done, by using a Atomic Int (CurrentWorkers.fetch_add(1) on entry, and CurentWorkers.fetch_sub(1), in stage 3) on exit.

Uses; IndexManager.Load() < NumberOfText to see if all assigned , just waiting to complete. If so blocks it from stage 3, and sends forward.

Uses; (CurrentWorkers.Load() == 0 )
to Reset; IndexManager.load(0)

void Stage2(int ThreadsLevel)
{
    if (ThreadsLevel == CurrentLevel.Load()) // < Less than Straggler Go Wait to do Next batches
    {
        if (IndexManager.load() < NumberOfTest) //greater send thread to wait for next batch
        {
            Stage3(ThreadsLevel);

            if (CurrentWorkers.load() == 0)
            {
                if (IndexManager.load() > 0)
                    IndexManager.store(0);
            }
        }
    }
}

**** I demonstrated all the Atomic Variables in use, in case you think this is a compiler ordering problem or something.

Stage1 THE PROBLEM; When the Test begins, Threads are created and sent to a Waiting room (Stage1_Wait),to eliminate the costly overhead of “Creating” and “Ending Threads” for each iteration of (A…Z). It Waits for “TestReady.load()”.

The MAIN thread goes out and gets the samples (A..Z), sets a Atomic Int (SamplesReady”) to Free the Threads. For now, it waits for all the threads to return (ActiveThreads), before grabbing the next Batch.

void Stage1_Wait() //PROBLEM
{
    std::atomic_int testing;

    while (ProcessingTest.load())
    {
        if (TestReady.load()) 
        {
            Stage2(ActiveLevel); //ActiveLevel Global Variable passed by value

            SamplesReady.store(0);
            ActiveThreads.fetch_sub(1);
        }
    }
}

void MainThread_ProcessLevel()
{
    ActiveThreads.store(NumberOfThreads);       
    SamplesReady.store(1);
    int waiting = 1;
    while (waiting)
    {
        if (!ActiveThreads.load())
            waiting = 0;
    }
}

Works fine with 1 Single thread off the main thread. Add 2 Threads, and it gets over 100 test levels, before “locking up”. Add more threads and it “Locks up” within 20.

Definition LOCKING UP: ActiveThreads.fetch_sub(1) IS NOT decrementing down to zero. Threads are passing it but no updates on the “fetch_sub(1)”. Therefore the MainThreadProcessing is just Waiting. It never reached “0” is stuck.

After using various “memory_order_options” – I said okay, let the threads flow, and had MainThread instead check for “SamplesReady.load() == 0”. GUESS WHAT… “SamplesReady “ isn’t getting updated, BUT NOW “ActiveTHreads” Sees every thread and is decrementing (Feels like I’m being Pranked by my computer – LOL).

So the PROBLEM seems to be MainThreadProcessing and Stage1_Wait. Last night I did a total rewrite/redesign of everything (final is what you see now). It crept up on me again (same area). Any Ideas???

After 3 Days I rewrote/redesigned the code (3 time). I tried various memory ordering options (after scanning StackOverflow and other sites) and still the same results in the same area.

UPDATED / ADDED PER REQUEST - "BAR MIN" I Even removed my Memory Order test, so the default is the strongest memory order. COMMENTS DEMONSTRATE WHERE... FIXED!

#include <iostream>
#include <thread>
#include <atomic>

std::atomic_int ProcessingSamples;
std::atomic_int SamplesReady;
std::atomic_int ActiveThreads;
std::thread ThreadCalls[8];
int NumberOfThreads;

void PretendToProcessSamples()
{
    double StupidMath;
    StupidMath = 1.0;
    StupidMath *= sqrt(StupidMath);
    StupidMath += tanh(StupidMath);
}

void WaitingRoom()
{
    static int BreakPointInt = 0;
    std::atomic_int testing;

    BreakPointInt++;
    while (ProcessingSamples.load())
    {
        if (SamplesReady.load())
        {
            PretendToProcessSamples();
            SamplesReady.store(0);
            ActiveThreads.fetch_sub(1);
        }
    }
}

void ProcessSamples(int i) // THREADED
{
    int waiting;

    ActiveThreads.store(NumberOfThreads);
    waiting = 1;
    SamplesReady.store(1);
    while (waiting)
    {
        if (!ActiveThreads.load())
            waiting = 0;
    }
}

void StartBackGroundThreads(int NbrOfThreads)
{
    if (NbrOfThreads) // IF 0, do normal operations
    {
        NumberOfThreads = NbrOfThreads;
        ProcessingSamples.store(1);
        SamplesReady.store(0);
        for (int z = 0; z < NumberOfThreads; z++)
        {
            ThreadCalls[z] = std::thread(WaitingRoom);
        }
    }
}

int main()
{
    StartBackGroundThreads(4);

    //for (int b = 0; b < 1000; b++)
    for (int i = 0; i < 4000; i++)
        ProcessSamples(i);

    ProcessingSamples.store(0);
    for (int z = 0; z < NumberOfThreads; z++)
    {
        ThreadCalls[z].join();
    }
}
Adrian E
  • 29
  • 6
  • 2
    Have you considered using TBB? Then your code can be simplified to a `for`, with a simple `tbb::parallel_for` nested in it. If you don't want to, please create a [mcve] and add it to the question. We want a complete piece of code (with unnecessary details stripped), that we can compile, run, and observe the issue (clearly state the issue; if the code hangs, state on what line it hangs, and how you determined that it does so). – HolyBlackCat Jul 30 '23 at 16:06
  • Done @HolyBlackCat. - TBB, adds a new learning curve. Thank you for the suggestion. I have append the Minimal, and ran it to make sure it runs with copy paste. – Adrian E Jul 30 '23 at 17:26
  • 1
    Looking at your minimal example. `ProcessSamples` sets `ActiveThreads==4` and `SamplesReady==1`, then spins and waits for `ActiveThreads` to go to zero. `WaitingRoom` waits for `SamplesReady`, does something, then decrements `ActiveThreads`. Now, if one thread gets to `SamplesReady.store(0); ActiveThreads.fetch_sub(1);` before any other thread notices that `SamplesReady` is set, then `ActiveThreads` will never go to zero, and all threads will spin forever. – Igor Tandetnik Jul 30 '23 at 18:19
  • 5
    The wise people of Stackoverflow heard your appeal and thusly proclaim: the endless quests for lock-free unicorn fairies using atomic objects will end in tears more often than not. True multithreaded bliss will be found in plain, old-fashioned, mutexes and condition variables. – Sam Varshavchik Jul 30 '23 at 18:21
  • 2
    Yes, correctness first, with multi threaded code, always. It's hard enough to get right as it is, the last thing you want is Heisenbugs like this – Paul Sanders Jul 30 '23 at 18:24
  • 1
    It's not even clear what you are trying to achieve. Do you want each sample to be processed (passed to `PretendToProcessSamples`) exactly once, by any worker thread? Do you want it processed four times, by any combination of threads? Do you want it processed exactly once by each thread, four times total? Do you want all processing on the first sample to finish before starting on the second sample, or do you want them processed in parallel? – Igor Tandetnik Jul 30 '23 at 18:27
  • @IgorTandetnik - You are right on the Min, Typing it I didn't realize how small that "PretendToProcessSamples" is. I should have put some delays or something. Believe me - it would reach it first. BUT - THANK YOU - because I do need to check and see if there is any possible way, in the software it could reach it before hand. – Adrian E Jul 30 '23 at 18:28
  • @SamVarshavchik - Thank you for the laughter, it's been a rough week. BUT now that I had to do the MIN version, I can look for something I missed. – Adrian E Jul 30 '23 at 18:31
  • @IgorTandetnik - PretendToProcessSamples was a place holder. THE REQUEST, asked to put the bare min code... I just didn't realize that Pretend needed to take some realistic time. WHICH brings us back to before I updated my question. – Adrian E Jul 30 '23 at 18:34
  • 3
    I, for one, cannot debug code I cannot see, code described only in prose. Hence the request for [mcve]. Your verbal description likely makes perfect sense to you, but it doesn't make any sense to me. Plus, there's always a possibility that your actual code doesn't reflect your intention, as described in prose (in other words, that it contains a bug). If the minimal example doesn't accurately reflect the details of the real system, then you'd have to come up with one that does. – Igor Tandetnik Jul 30 '23 at 18:37
  • @IgorTandetnik - Point made good sir. BUT NOW, you sent me looking for other causes... 5 Days of this - BUT NOW, I have something else to dig into and see if its a problem. ITHANK YOU SIR – Adrian E Jul 30 '23 at 18:45
  • 4
    *"TBB, adds a new learning curve"* There's almost no curve, just a single function (that you need) that loops over things in parallel. On the contrary, atomics and the memory model have a horrible learning curve. – HolyBlackCat Jul 30 '23 at 18:55
  • 2
    If threads are meant to perform long-running processing, then lock-free algorithm makes even less sense. In your minimal example, if we assume `PretendToProcessSamples` takes a long time, then the main thread is spinning in a busy loop that whole time, turning one CPU core (that could have been doing useful work) into a space heater. It would be better for all involved if that thread were instead suspended blocking on a mutex or similar. – Igor Tandetnik Jul 30 '23 at 19:18
  • @IgorTandetnik - Found it elsewhere. BECAUSE you questioned something. SO I looked. elsewhere. REGARDING the Wait time... This will be altered once it is working to go get the next samples and while the threads do they work. I just needed to get over this hurdle. THANKS – Adrian E Jul 30 '23 at 19:38
  • @HolyBlackCat - TBB would add 1 extra log in the fire of using atomics. BUT SIR, you asking for the Min example. - started the path of figuring out what was going on. ALTHOUGH it locked up at the point of the code I was referring to, the cause , "Windows tasking out a thread", and causing a hiccup elsewhere. THANK YOU for pushing that task. – Adrian E Jul 30 '23 at 20:03

1 Answers1

-1

OKAY - THANK YOU Kind folks for HELPING. Asking for "minimal reproducible example" wouldn't work because the REAL version is so math intensive, it wouldn't take that short of time. AND YET @IgorTandetnik !!!!

@IgorTandetnik **** YOU caused me to be Suspicious. I DID modify the REAL code, so that no threads had to wait (no bottlenecks). So some will jump straight out, and wait for the next level. BUT that still wasn't the problem. EXCEPT...

Naughty windows tasking out of a thread, is just enough time to bring up THIS condition of ELSEWHERE !!! Considering this, I did a hard block.

ActiveThreads.fetch_sub(1);
while(Threads.load())
{
}
SamplesReady.store(0);

AND lo and behold I am a thread short. Continuing debugging I had to hit break points on a one of two loops that monitor completion of a task, and there it was. STUCK. 100 * n, number of times, a windows hiccuped (context switching) on a thread, and it would be dead in the water.

YOU ALL HELPED me shake the thing I have been looking at for 3 days. THANK YOU THANK YOU THANK YOU!!!! A lot of atomics, but it FAST when you remove mutex!!!

Clifford
  • 88,407
  • 13
  • 85
  • 165
Adrian E
  • 29
  • 6