I would like to parallel a big loop using OpenMP to improve its efficiency. Here is the main part of the toy code:
vector<int> config;
config.resize(indices.size());
omp_set_num_threads(2);
#pragma omp parallel for schedule(static, 5000) firstprivate(config)
for (int i = 0; i < 10000; ++i) { // the outer loop that I would like to parallel
#pragma omp simd
for (int j = 0; j < indices.size(); ++j) { // pick some columns from a big ref_table
config[j] = ref_table[i][indices[j]];
}
int index = GetIndex(config); // do simple computations on the picked values to get the index
#pragma omp atomic
result[index]++;
}
Then I found I cannot get improvements in efficiency if I use 2, 4, or 8 threads. The execution time of the parallel versions is generally greater than that of the sequential version. The outer loop has 10000 iterations and they are independent so I want multiple threads to execute those iterations in parallel.
I guess the reasons for performance decrease maybe include: private copies of config
? or, random access of ref_table
? or, expensive atomic operation? So what are the exact reasons for the performance decrease? More importantly, how can I get a shorter execution time?