A little bit of background - I'm running the following setup:
- i5 8300H (4 core, 8 threads)
- 32 GB RAM
- Ubuntu 19.10
- GCC 9.2.1, C++17 standard
I have a thread manager - essentially an object that you can relay some data to, you give it a callable object, and then you can run the task in parallel, and the thread manager has the ability to time out threads (if some task hangs, as that can be the case for the thing I'm doing), give them data in batches, etc.
The pseudo-code for this behaviour is as follows:
function do_tasks(task, data, batch_size, timeout, threads, output_streams):
assert arguments_are_valid()
failed_tasks = []
while(true):
if data.size() == 0:
break
for thread in threads:
if thread.running():
stop_thread(thread)
if thread.results.size() == 0:
failed_tasks <- failed_tasks + thread.given_data
else:
data <- data + thread.given_data(data.begin() + thread.results.size(), thread.given_data.end())
start_thread(thread, task, take_data(data, min(batch_size, data.size()))
wait_for_threads_completed_or_timeout(threads, timeout)
return failed_tasks
I'm not using anything exotic, this is all accomplished using plain std::thread, std::list, std::future and std::promise.
Long story short, you give the thread its data. When you evaluate what the thread has done, if the whole batch fails (i.e. none of the data elements are solved), the whole batch is transferred into a failed_tasks container which is later returned. These failed batches are then later resolved by running the tasks with batch_size of 1 (so when a task times out, it really is something that has to be checked out by hand), but that part is not important. If at least 1 of the data elements are resolved, then you transfer the unresolved part back to the data container. This runs until all data elements are either resolved or marked as failed_tasks.
Now, usually, lets say I run this on 100000 elements on 7 threads. What happens is that the first time I run it, up to 2000 elements time out. Second time also something similar, 500-2000 elements time out. But here's the weird part - after running it a few times, I get the intended behaviour, around 2-5 tasks fail.
Looking at the function that is being run, it can process 10500 data elements per second on average single threaded. Its minimum running time is less than a nanosecond, while it's max observed running time is a few miliseconds (it matches data with regular expressions and there are sequences which act as DoS attacks, more or less, and can therefore slow down the execution considerably). Running it on 7 threads gives usually enables the processing of 70000 data elements per second on average, so the efficiency is around 95%. However, when the first few runs occur, this drops to as low as 55000 data elements per second, which is around 75% efficiency, a considerable drop in performance. Now, performance is not that critical (I need to process 20000 data elements per second, a task 2 threads are enough for), but along with lower performance comes a higher number of failed tasks, which lead me to suspect that the problem is in the threads themselves.
I have read this:
What really is to “warm up” threads on multithreading processing?
but it seems that the behaviour is caused by the JIT interpreter, something C++ doesn't have as it's compiled. I know about std::thread overhead but suspect it's not this big. What I'm experiencing here is similar to warm up, but I have never heard about threads having a warm up period. This behaviour is consistent even when I change the data (every run, different data set) so I suspect there is no caching going on that would speed it up.
The implementation is probably correct, it has been reviewed and formally tested. The code is mostly C and C++ and is being actively maintained so I suspect this isn't a bug. But I could not find anyone else on the internet having the same problem, so it left me wondering if there's anything we're missing.
Anyone have an idea why this warm-up happens?
EDIT: The work is executed like this:
for(ull i = 0; i != batch_size && future.wait_for(nanoseconds(0)) == future_status::timeout; ++i)
{
//do stuff
}
The function that is run by the thread receives a future that the thread can check before running the task on the next data element, here it's called future.