2

Assume that each loop iteration of my code takes the same time.
Please note that each loop iteration involves memory access from disjoint portions of a large contiguous memory.
I am using VS2019 compiler.

I thought it should not matter whether I use

#pragma omp for schedule(static, CHUNKSIZE)

OR

#pragma omp for schedule(static)

I have used values like 5 for CHUNKSIZE. I am asking this because I see the first variation performs slightly better.
Can someone throw some light?

dreamcrash
  • 47,137
  • 25
  • 94
  • 117
fonishormon
  • 351
  • 1
  • 2
  • 8

1 Answers1

2

If you do not specify a chunk

#pragma omp for schedule(static)

OpenMP will:

Divide the loop into equal-sized chunks or as equal as possible in the case where the number of loop iterations is not evenly divisible by the number of threads multiplied by the chunk size. By default, chunk size is loop_count/number_of_threads

Hence, for a CHUNKSIZE=5, 2 threads and a loop (to be parallelized) with 22 iterations. To thread ID=0 will be assign the iterations {0 to 10} and to thread ID=1 {11 to 21}. Each thread with 11 iterations. However, for:

#pragma omp for schedule(static, CHUNKSIZE)

to thread ID=0 will be assign the iterations {0 to 4}, {10 to 14} and {20 to 21}, whereas thread ID=1 will work with the iterations {5 to 9} and {15 to 19}. Therefore, to the first and second threads it was assigned 12 and 10 iterations, respectively.

All this to show that having

#pragma omp for schedule(static)

and

 #pragma omp for schedule(static, CHUNKSIZE)

is not the same. Different chunk sizes, might affects directly the loading balancing, and cache misses, among others. Even if one:

Assume that each loop iteration of my code takes the same time

Naturally, thinks get more complicated, if each iteration of the loop being parallelized is performing a different among of work. For instance:

for(int i = 0; i < 22; i++)
  for(int j = i+1; j < 22 ; i++)
     // do the same work.

With

#pragma omp for schedule(static)

Thread ID=0 would execute 176 iterations whereas Thread ID=1 55. With a load unbalance of 176 - 55 = 121.

whereas with

#pragma omp for schedule(static, CHUNKSIZE)

Thread ID=0 would execute 141 iterations and Thread ID=1 90. With a load unbalance of 141 - 90 = 51.

As you can see in this case without the chunk, one thread performed 121 parallel tasks more than the other thread, whereas with a chunk=5, the difference was reduced to 51.

To conclude, it depends on your code, the hardware where that code is being executed, how you are performing the benchmark, how big is the time difference, and so. The bottom line is: you need to analyze it, look for potential loading balancing problems, measure cache misses, and so on. Profiling is always the answer.

dreamcrash
  • 47,137
  • 25
  • 94
  • 117