If you do not specify a chunk
#pragma omp for schedule(static)
OpenMP will:
Divide the loop into equal-sized chunks or as equal as possible in the
case where the number of loop iterations is not evenly divisible by
the number of threads multiplied by the chunk size. By default, chunk
size is loop_count/number_of_threads
Hence, for a CHUNKSIZE=5
, 2 threads
and a loop (to be parallelized) with 22
iterations. To thread ID=0
will be assign the iterations {0 to 10}
and to thread ID=1
{11 to 21}
. Each thread with 11
iterations. However, for:
#pragma omp for schedule(static, CHUNKSIZE)
to thread ID=0
will be assign the iterations {0 to 4}
, {10 to 14}
and {20 to 21}
, whereas thread ID=1
will work with the iterations {5 to 9}
and {15 to 19}
. Therefore, to the first and second threads it was assigned 12
and 10
iterations, respectively.
All this to show that having
#pragma omp for schedule(static)
and
#pragma omp for schedule(static, CHUNKSIZE)
is not the same. Different chunk sizes, might affects directly the loading balancing
, and cache misses, among others. Even if one:
Assume that each loop iteration of my code takes the same time
Naturally, thinks get more complicated, if each iteration of the loop being parallelized is performing a different among of work. For instance:
for(int i = 0; i < 22; i++)
for(int j = i+1; j < 22 ; i++)
// do the same work.
With
#pragma omp for schedule(static)
Thread ID=0
would execute 176
iterations whereas Thread ID=1
55
. With a load unbalance of 176 - 55 = 121
.
whereas with
#pragma omp for schedule(static, CHUNKSIZE)
Thread ID=0
would execute 141
iterations and Thread ID=1
90
. With a load unbalance of 141 - 90 = 51
.
As you can see in this case without the chunk, one thread performed 121
parallel tasks more than the other thread, whereas with a chunk=5
, the difference was reduced to 51
.
To conclude, it depends on your code, the hardware where that code is being executed, how you are performing the benchmark, how big is the time difference, and so. The bottom line is: you need to analyze it, look for potential loading balancing problems, measure cache misses, and so on. Profiling is always the answer.