2

The program I am implementing involves iterating over a medium amount of independent data, performing some computation, collecting the result, and then looping back over again. This loop needs to execute very quickly. To solve this, I am trying to implement the following thread pattern.

Thread Pattern

Rather than spawn threads in setup and join them in collect, I would like to spawn all threads initially, and keep them synchronized throughout their loops. This question regarding thread barriers had initially seemed to point me in the right direction, but my implementation of them is not working. Below is my example

int main() {

    int counter = 0;
    int threadcount = 10;

    auto on_completion = [&]() noexcept {
        ++counter; // Incremenent counter
    };

    std::barrier sync_point(threadcount, on_completion);

    auto work = [&]() {
        while(true)
            sync_point.arrive_and_wait(); // Keep cycling the sync point
    };

    std::vector<std::thread> threads;

    for (int i = 0; i < threadcount; ++i) 
        threads.emplace_back(work); // Start every thread
    

    for (auto& thread : threads) 
        thread.join();
    
}

To keep things as simple as possible, there is no computation being done in the worker threads, and I have done away with the setup thread. I am simply cycling the threads, syncing them after each cycle, and keeping a count of how many times they have looped. However, this code is deadlocking very quickly. More threads = faster deadlock. Adding work/delay inside the compute threads slows down the deadlock, but does not stop it.

Am I abusing the thread barrier? Is this unexpected behavior? Is there a cleaner way to implement this pattern?

Edit

It looks like removing the on_completion gets rid of the deadlock. I tried a different approach to meet the synchronization requirements without using the function, but it still deadlocks fairly quickly.

int threadcount = 10;

std::barrier start_point(threadcount + 1);
std::barrier stop_point(threadcount + 1);

auto work = [&](int i) {
    while(true) {
        start_point.arrive_and_wait();
        stop_point.arrive_and_wait();
    }
};

std::vector<std::thread> threads;

for (int i = 0; i < threadcount; ++i) {
    threads.emplace_back(work, i);
}

while (true) {
    std::cout << "Setup" << std::endl;
    start_point.arrive_and_wait(); // Sync to start
    // Workers do work here
    stop_point.arrive_and_wait(); // Sync to end
    std::cout << "Collect" << std::endl;
}
DJP
  • 31
  • 1
  • 4
  • What happens if you remove your `on_completion` function? – Nicol Bolas Sep 24 '22 at 03:34
  • @NicolBolas Removing it and moving the counter to the worker thread with no setup or collection does get rid of the deadlock. I tried a different implementation to sync them with setup and all (see edit) however this still leads to deadlock – DJP Sep 24 '22 at 03:59
  • Your second version now has a data race, since every thread is doing a read/modify/write on the same object with no synchronization. – Nicol Bolas Sep 24 '22 at 04:09
  • @NicolBolas Just corrected it by removing access to any data inside the worker threads, the behavior however remains the same. – DJP Sep 24 '22 at 05:39
  • Just a suggestion: have you tried to spezialize the barrier: `std::barrier`? The difference is that _by using the default template argument the completion step is run as part of the call to arrive that caused the expected count to reach zero and for other specializations, the completion step is run on one of the threads that arrived at the barrier during the phase._ So in the first case, the thread that caused the counter to hit zero runs the completition function and otherwise some randomly chosen thread. If it's worth noting, i've run it now many times, no deadlock. – Erdal Küçük Sep 24 '22 at 06:30
  • As a hint (similar task as yours but C related and `pthread_barrier_wait`): https://stackoverflow.com/questions/69008816/how-to-properly-synchronize-threads-at-barriers/69062834#69062834 – Erdal Küçük Sep 24 '22 at 06:48
  • @ErdalKüçük Would you mind sharing the system specs that you ran on without deadlock? I tried running identical code on a different system and it is working there. I am starting to wonder if it is a bug. System that worked: Intel, macOS, clang Did not work on: AMD, Windows 11 MinGW, GCC – DJP Sep 24 '22 at 15:59
  • Linux 5.19.10-arch1-1 #1 SMP PREEMPT_DYNAMIC x86_64 Intel GNU/Linux; gcc (GCC) 12.2.0. I've realized that i've tested only in debug mode (-g), i'll test it with optimization on. – Erdal Küçük Sep 24 '22 at 17:03
  • I should also remark, that i have a `std::printf` in the completition (after `++counter`) and the thread function (before `arrive_and_wait`). But even withou any output, i do not have a deadlock. – Erdal Küçük Sep 24 '22 at 17:09

0 Answers0