0

I would like to parallel a big loop using OpenMP to improve its efficiency. Here is the main part of the toy code:

vector<int> config;
config.resize(indices.size());

omp_set_num_threads(2);
#pragma omp parallel for schedule(static, 5000) firstprivate(config)
    for (int i = 0; i < 10000; ++i) { // the outer loop that I would like to parallel
#pragma omp simd
        for (int j = 0; j < indices.size(); ++j) { // pick some columns from a big ref_table
            config[j] = ref_table[i][indices[j]];
        }
        int index = GetIndex(config); // do simple computations on the picked values to get the index
#pragma omp atomic
        result[index]++;
    }

Then I found I cannot get improvements in efficiency if I use 2, 4, or 8 threads. The execution time of the parallel versions is generally greater than that of the sequential version. The outer loop has 10000 iterations and they are independent so I want multiple threads to execute those iterations in parallel.

I guess the reasons for performance decrease maybe include: private copies of config? or, random access of ref_table? or, expensive atomic operation? So what are the exact reasons for the performance decrease? More importantly, how can I get a shorter execution time?

Joxixi
  • 651
  • 5
  • 18
  • 1
    What does `GetIndex` do? Any explicit or hidden memory allocations (like using a vector or list)? – 1201ProgramAlarm Jul 24 '21 at 16:33
  • How big is indices.size()? – Laci Jul 24 '21 at 16:36
  • 1
    `"or, random access of ref_table?"` Multiple threads accessing the same memory location is not a problem, as long as these accesses are strictly read-only. It will only become a problem if at least one of the threads performs a write operation on that location. – Andreas Wenzel Jul 24 '21 at 16:40
  • 1
    I think, as usual, your code is memory bound. It means that the speed of your program mainly depends on speed of memory read/write. Please read this answer, I think it may apply to your case as well: https://stackoverflow.com/questions/68503586/using-openmp-to-parallelize-a-loop-over-a-vector-c-objects/68507395#68507395 – Laci Jul 24 '21 at 16:41
  • @1201ProgramAlarm It does not contain memory allocations. It contains some simple additions and multiplications. – Joxixi Jul 25 '21 at 00:50
  • @Laci The size of indices is less than 10 and no less than 2. – Joxixi Jul 25 '21 at 00:51
  • 1
    I don't think the `#pragma omp simd` should give any speedup even if `indices` were bigger because you are doing a gather operation in the inner loop which I can not imagine to profit from vectorization. – paleonix Jul 25 '21 at 10:13

1 Answers1

1

Private copies of config or, random access of ref_tables are not problematic, I think the workload is very small, there are 2 potential issues which prevent efficient parallelization:

  1. atomic operation is too expensive.
  2. overheads are bigger than workload (it simply means that it is not worth parallelizing with OpenMP)

I do not know which one is more significant in your case, so it is worth trying to get rid of atomic operation. There are 2 cases:

a) If the results array is zero initialized you have to use:

#pragma omp parallel for reduction(+:result[0:N]) schedule(static, 5000) firstprivate(config) where N is the size of result array and delete #pragma omp atomic. Note that this works on OpenMP 4.5 or later. It is also worth removing #parama omp simd for a loop of 2-10 iterations. So, your code should look like this:

#pragma omp parallel for reduction(+:result[0:N]) schedule(static, 5000) firstprivate(config)
    for (int i = 0; i < 10000; ++i) { // the outer loop that I would like to parallel
        for (int j = 0; j < indices.size(); ++j) { // pick some columns from a big ref_table
            config[j] = ref_table[i][indices[j]];
        }
        int index = GetIndex(config); // do simple computations on the picked values to get the index
        result[index]++;
    }

b) If the result array is not zero initialized the solution is very similar, but use a temporary zero initialized array in the loop and after that add it to result array.

If the speed will not increase then your code is not worth parallelizing with OpenMP on your hardware.

Laci
  • 2,738
  • 1
  • 13
  • 22