The default parallel loop scheduling for virtually all existing OpenMP compilers is static
, which means that the OpenMP runtime will try to split the iteration space evenly between the threads and do a static work assignment. Since you have 50 iterations and 20 threads, the work cannot be split equally as 20 does not divide 50. Therefore, half of the threads will do three iterations while the other half will do two iterations.
There is an implicit barrier at the end of the (combined parallel
) for
construct where the threads that finish earlier wait for the rest of the threads to complete. Depending on the OpenMP implementation, the barrier might be implemented as busy waiting loop, as a wait operation on some OS synchronisation object, or as a combination of both. In the latter two cases, the CPU usage of the threads that hit the barrier will either immediately drop to zero as they go into interruptible sleep or will initially remain at 100% for a short time (the busy loop) and then drop to zero (the wait).
If the loop iterations take exactly the same amount of time, then what will happen is that the CPU usage will be 2000% initially, then after two iterations (and a bit more if the barrier implementation uses a short busy loop) will drop to 1000%. If the iterations take different amount of time each, then the threads will arrive at different moments at the barrier and the CPU usage will decrease gradually.
In any case, use schedule(dynamic)
to have each iteration given to the first thread to become available. This will improve the CPU utilisation in the case when the iterations are taking varying amount of time. It will not help when the iterations are taking the same amount of time each. The solution in that latter case will be to have the number of iterations as an integer multiple of the number of threads.