2

A little bit of background - I'm running the following setup:

  • i5 8300H (4 core, 8 threads)
  • 32 GB RAM
  • Ubuntu 19.10
  • GCC 9.2.1, C++17 standard

I have a thread manager - essentially an object that you can relay some data to, you give it a callable object, and then you can run the task in parallel, and the thread manager has the ability to time out threads (if some task hangs, as that can be the case for the thing I'm doing), give them data in batches, etc.

The pseudo-code for this behaviour is as follows:

function do_tasks(task, data, batch_size, timeout, threads, output_streams):
    assert arguments_are_valid()

    failed_tasks = []

    while(true):
        if data.size() == 0:
            break

        for thread in threads:
            if thread.running():
                stop_thread(thread)

            if thread.results.size() == 0:
                failed_tasks <- failed_tasks + thread.given_data
            else:
                data <- data + thread.given_data(data.begin() + thread.results.size(), thread.given_data.end())

            start_thread(thread, task, take_data(data, min(batch_size, data.size()))

        wait_for_threads_completed_or_timeout(threads, timeout)

    return failed_tasks

I'm not using anything exotic, this is all accomplished using plain std::thread, std::list, std::future and std::promise.

Long story short, you give the thread its data. When you evaluate what the thread has done, if the whole batch fails (i.e. none of the data elements are solved), the whole batch is transferred into a failed_tasks container which is later returned. These failed batches are then later resolved by running the tasks with batch_size of 1 (so when a task times out, it really is something that has to be checked out by hand), but that part is not important. If at least 1 of the data elements are resolved, then you transfer the unresolved part back to the data container. This runs until all data elements are either resolved or marked as failed_tasks.

Now, usually, lets say I run this on 100000 elements on 7 threads. What happens is that the first time I run it, up to 2000 elements time out. Second time also something similar, 500-2000 elements time out. But here's the weird part - after running it a few times, I get the intended behaviour, around 2-5 tasks fail.

Looking at the function that is being run, it can process 10500 data elements per second on average single threaded. Its minimum running time is less than a nanosecond, while it's max observed running time is a few miliseconds (it matches data with regular expressions and there are sequences which act as DoS attacks, more or less, and can therefore slow down the execution considerably). Running it on 7 threads gives usually enables the processing of 70000 data elements per second on average, so the efficiency is around 95%. However, when the first few runs occur, this drops to as low as 55000 data elements per second, which is around 75% efficiency, a considerable drop in performance. Now, performance is not that critical (I need to process 20000 data elements per second, a task 2 threads are enough for), but along with lower performance comes a higher number of failed tasks, which lead me to suspect that the problem is in the threads themselves.

I have read this:

What really is to “warm up” threads on multithreading processing?

but it seems that the behaviour is caused by the JIT interpreter, something C++ doesn't have as it's compiled. I know about std::thread overhead but suspect it's not this big. What I'm experiencing here is similar to warm up, but I have never heard about threads having a warm up period. This behaviour is consistent even when I change the data (every run, different data set) so I suspect there is no caching going on that would speed it up.

The implementation is probably correct, it has been reviewed and formally tested. The code is mostly C and C++ and is being actively maintained so I suspect this isn't a bug. But I could not find anyone else on the internet having the same problem, so it left me wondering if there's anything we're missing.

Anyone have an idea why this warm-up happens?

EDIT: The work is executed like this:

for(ull i = 0; i != batch_size && future.wait_for(nanoseconds(0)) == future_status::timeout; ++i)
{
    //do stuff
}

The function that is run by the thread receives a future that the thread can check before running the task on the next data element, here it's called future.

Ljac
  • 31
  • 4
  • 1
    I think you know what we're going to say...! – Lightness Races in Orbit Nov 06 '19 at 11:29
  • 1
    How do you timeout threads? ie what do you do after the time limit to stop the progress of a thread that is taking too long? – Richard Critten Nov 06 '19 at 11:30
  • This isn't very useful. But is the basic reason why jitters and garbage collectors became a practical compromise on recent hardware. They only have to be as fast as the cost of a hard paging fault. Gobs of time. You rarely notice, until you start measuring. – Hans Passant Nov 06 '19 at 11:53
  • @RichardCritten After the time limit is reached or all threads complete their work the wait_for_threads_completed_or_timeout function returns, resuming the infinite while loop. The infinite while loop will then stop a thread by setting a promise and joining it. The task function checks if the future status is ready upon each iteration, so when a promise is set, the thread will finish the last task but wont start a new one since the promise has been set. I have edited the question to include how the threads do work since it's not practically to put the code example here. – Ljac Nov 06 '19 at 12:23
  • 2
    _Do C++ std::threads have a warm-up period?_ No. Apart from that, without the source code and the data no answer can be given. – Maxim Egorushkin Nov 06 '19 at 13:20
  • @MaximEgorushkin So what you're saying is that this issue is, if reproduced on a different machine, likely to be caused by software engineering? Sadly I can't post the code as it's a few hundred lines of it and I'm under an NDA – Ljac Nov 06 '19 at 16:09

0 Answers0