Incorrect measurement of the code execution time inside OpenMP thread

Question

So I need to measure execution time of some code inside for loop. Originally, I needed to measure several different activities, so I wrote a timer class to help me with that. After that I tried to speed things up by paralleling the for loop using OpenMP. The problem is that when running my code in parallel my time measurements become really different - the values increase approximately up to a factor of 10. So to avoid the possibility of flaw inside the timer class I started to measure execution time of the whole loop iteration, so structurally my code looks something like this:

#pragma omp parallel for num_threads(20)
for(size_t j = 0; j < entries.size(); ++j)
{
    auto t1 = std::chrono::steady_clock::now();
    // do stuff
    auto t2 = std::chrono::steady_clock::now();

    std::cout << "Execution time is " 
              << std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1).count()
              << std::endl;
}

Here are some examples of difference between measurements in parallel and measurements in single thread:

Single-threaded	Multi-threaded
11.363545868021	94.154685442
4.8963048650184	16.618173163
4.939025568	25.4751074
18.447368772	110.709813843

Even though it is only a couple of examples, this behaviour seems to prevail in all loop iterations. I also tried to use boost's chrono library and thread_clock but got the same result. Do I misunderstand something? What may be the cause of this? Maybe I get cumulative time of all threads?

Inside the for loop, during each iteration I read a different file. Based on this file I create and solve multitude of mixed-integer optimisation models. I solve them with MIP solver, which I set to run in one thread. The instance of the solver is created on each iteration. The only variables that is shared between iteration are constant strings which represents paths to some directories.

My machine has 32 threads (16 cores, 2 threads per core).

Also here are the timings of the whole application in single-threaded mode:

real    23m17.763s
user    21m46.284s
sys     1m28.187s

and in multi-threaded mode:

real    12m47.657s
user    156m20.479s
sys     2m34.311s

It is laughably easy to write code that is slower when multithreaded, often easier than writing code that is faster. You will need to provide more information about the algorithm being parallelized and your implementation. Consider providing a [mre]. — user4581301, Jan 16 '23 at 16:11
*my code looks something like this* Given that, all you can get is *something like an answer*. For a good answer post your actual code and use `omp_get_wtime` to time your OpenMP programs. — High Performance Mark, Jan 16 '23 at 16:13
@user4581301 Actually, overall, the application runs faster in multi-threaded case (which can be seen at the end of the question). Despite that, the individual timing of each loop is reported to be slower by std::chrono library — Aisec Nory, Jan 16 '23 at 16:30
@PepijnKramer He has 16 cores with 2 hyperthreads. OP: don't use more than 16 threads. What does OpenMP report as `omp_get_num_procs`? Also, don't do I/O inside a parallel region. Allocate an array of timings, each thread writes its element, then print out after the parallel region. — Victor Eijkhout, Jan 16 '23 at 16:30
*during each iteration I read a different file* : concurrent I/Os on the same drive can be bad, particularly if the drive is a HDD. — PierU, Jan 16 '23 at 16:32
@HighPerformanceMark I tried to use `omp_get_wtime`. It produces the same result as `std::chrono::steady_clock`. Also I've edited the question to give some explanation what is happening inside the for loop, but I doubt it'll be helpful. — Aisec Nory, Jan 16 '23 at 16:33
There is now some contradiction between *when running my code in parallel my time measurements become really different - the values increase approximately up to a factor of 10* and *Actually, overall, the application runs faster in multi-threaded case* — PierU, Jan 16 '23 at 16:38
@PepijnKramer `omp_get_num_procs` reports value 32. Thanks on the advide about I/O. I'll try to remove it from parallel region to see if it helps. — Aisec Nory, Jan 16 '23 at 16:39
@PierU So if you time the overall execution time with Linux's `time` command, the multi-threaded version will complete faster (in 12m 47.657s) opposed to single-threaded version (23m 17.763s). But in the console output majority (I didn't check the output from every iteration) of the for loop iterations of multi-threaded version report to take much longer than corresponding iterations of single threaded version. — Aisec Nory, Jan 16 '23 at 16:52
@AisecNory In a code that scales perfectly, the wall time of an individual iteration is the same between the serial and the parallel execution. But if the code doesn't scale perfectly the wall time of an individual iteration increases (almost by definition, actually). So you have a code that doesn't scale very well here. You should time your code (overall time) for different numbers of threads: 1, 2, 4, 8, 16... It's not rare that too many threads result in a slower execution time. — PierU, Jan 16 '23 at 17:01
Why do you measure time inside the parallel loop? You may intend to measure the runtime of the whole loop... — Laci, Jan 16 '23 at 17:49
@Laci I need to measure each loop iteration on its own, because I need to gather distinctive time for each of my input files, for further statistical analysis. — Aisec Nory, Jan 17 '23 at 00:34
@AisecNory Then you need to measure the CPU time, not the wall time. — PierU, Jan 17 '23 at 09:24

Jerry Coffin · Answer 1 · 2023-01-16T18:57:34.047

A few points here.

What you're measuring corresponds (roughly) to what time returns as the user time--that is total CPU time consumed by all threads. But when we look at the real time reported by time, we see that your multithreaded code is running close to twice as fast as the single threaded code. So it is scaling to some degree--but not very well.

Reading a file in the parallel region may well be part of this. Even at best, the fastest NVMe SSDs can only support reading from a few (e.g., around three or four) threads concurrently before you're using the drive's entire available bandwidth (and if you're doing I/O efficiently that may well be closer to 2. If you're using an actual spinning hard drive, it's usually pretty easy for a single thread to saturate the drive's bandwidth. A PCIe 5 SSD should keep up with more threads, but I kind of doubt even it has the bandwidth to feed 20 threads.

Depending on what parts of the standard library you're using, it's pretty easy to have some "invisible" shared variables. For one common example, quite code that uses Monte Carlo methods will frequently have calls to rand(). Even though it looks like a normal function call, rand() will typically end up using a seed variable that's shared between threads, and every call to rand() not only reads but also writes to that shared variable--so the calls to rand() all end up serialized.

You mention your MIP solver running in a single thread, but say there's a separate instance per thread, leaving it unclear whether the MIP solving code is really one thread shared between the 20 other threads, or that you have one MIP solver instance running in each of the 20 threads. I'd guess the latter, but if it's really the former, then it's being a bottleneck wouldn't seem surprising at all.

Without code to look at, it's impossible to get really specific though.

You said "What you're measuring corresponds (roughly) to what time returns as the user time--that is total CPU time consumed by all threads" Do you mean that std::chrono's steady_clock measure time consumed by all threads? — Aisec Nory, Jan 16 '23 at 17:03
@AisecNory: well, I suppose it depends a bit on what you do with the result. Based on your numbers, I guessed you were adding together the numbers you got from the individual threads, but rereading your question, it's not clear whether that's accurate or not. — Jerry Coffin, Jan 16 '23 at 18:49

Incorrect measurement of the code execution time inside OpenMP thread

1 Answers1