Using openMP to parallelize a loop over a vector c++ objects

Question

I'm trying to increase the performance of a c++ code using openMP but am not seeing very good scaling. Before delving into the details of my code, I have a very general question that I think could save a lot of time if I can get a definitive answer to it.

The basic structure of the code is a vector of objects (let's say size num_objs = 5000) where each object holds a relatively small vector of doubles (let's say size num_elems = 500). I want to loop through this vector of objects, and for each object, perform a subloop on the member vector to modify each element. I am only attempting to parallelize the outer loop (over the objects) as this is the standard approach with openMP and this loop is much larger than the nested one.

So now for my question. Am I taking a severe performance hit by looping over the array of objects and then looping over each of their smaller member vectors? Should I expect significant increase in performance if I instead made one large vector of size num_objects * num_elems and then did a parallel loop over "chunks" of that big vector that would correspond to the member vectors stored in each object that I described above? That way both the outer loop and inner loop will be accessing data from one big vector rather than have to fetch data from separate objects?

The actual code is much more complicated than it would seem by the above description, so in order to try this alternative approach would require a lot of time modifying. Therefore, I just wanted to get a feel for how significant of a speedup I could get if I spent the time restructuring the entire code. I don't have a lot of knowledge about computer architecture, memory access, caches, etc., so apologies if this is painfully obvious.

EDIT: I was thinking there was possibly a simple answer to this; however, I see that's not really the case. Please consider the following (simplified example).

#include <cmath>
#include <ctime>
#include <iostream>
#include <omp.h>
#include <string>
#include <vector>

class Block {
public:
  static double a;
  std::vector<double> x;
  std::vector<double> y;
  Block(int N);
};

double Block::a = 5;

int main(int argc, char const *argv[]) {
  int num_blocks = 80000;
  int num_elems = 1000;
  int num_iter = 100;

  int nthreads = 1;
  bool parallel_on = true;

  omp_set_num_threads(nthreads);

  std::vector<Block> block_vec;

  for (int i = 0; i < num_blocks; i++) {
    block_vec.push_back(Block(num_elems));
  }

  double start;
  double end;
  start = omp_get_wtime();

  int iter = 0;

  while (iter < num_iter) {
#pragma omp parallel for if (parallel_on)
    for (int bl = 0; bl < num_blocks; bl++) {
      for (int i = 0; i < num_elems; i++) {
        block_vec[bl].x[i] = Block::a * block_vec[bl].y[i] + block_vec[bl].x[i];
      }
    }
    iter++;
    std::cout << "ITER: " << iter << std::endl;
  }

  end = omp_get_wtime();
  double time_taken = end - start;
  std::cout << "TIME: " << time_taken << std::endl;

  return 0;
}

Block::Block(int N) {
  x.assign(N, 2.0);
  y.assign(N, 3.0);
}

I compile this program with:

g++ -fopenmp -O3 saxpy.cpp

I'm running it on an i7-6700 CPU @ 3.40GHz (four physical cores and eight logical cores). Here is the computational time for differing thread counts:

1 THREAD: 8.65s
2 THREAD: 7.37s
3 THREAD: 7.41s
4 THREAD: 7.65s

I did try a version of this code as I described above that makes use of one big vector rather than the nested loop; however, it was about the same result, actually a little slower.

[Edit] the question to include a [mre] of your code with loops. A description is not sufficient. (We don't need to full content of the loops, but some example code showing what you're doing and how you're doing it would help greatly.) — 1201ProgramAlarm, Jul 23 '21 at 19:42
Most probably you should not expect a big increase, but without the code you cannot get a definitive answer. You should investigate using a profiler why it is not scaling well (e.g. false sharing, overheads, load imbalance, etc) . How long does it take to run your code? — Laci, Jul 23 '21 at 21:22
Thanks for taking the time to respond. I was thinking that, due to my inexperience on this topic, there was a going to be a straightforward, general answer to this, but I see that's not the case. I edited my question with a simple example that has the same basic structure as the code I'm dealing with. This code also shows similar performance issues (see execution times vs threads above). Any help diagnosing this would be greatly appreciated. — mrmudd, Jul 24 '21 at 01:36

Laci · Accepted Answer · 2021-07-24T17:46:53.443

1

The speed of your program mainly depends on the speed of memory read/write (including cache utilization,etc). Depending on the hardware you may or may not observe speed increase. For more details please read e.g. this.

On my laptop (i7-8550U, g++ -fopenmp -O3 -mavx2 saxpy.cpp) I got similar result, but on a Xeon server I got significant speed improvement:

nthreads=1     
TIME: 13.0372
real    0m14.303s
user    0m13.206s
sys     0m1.096s

nthreads=4
TIME: 5.1537
real    0m5.921s
user    0m18.473s
sys     0m0.615s

nthreads=8
TIME: 3.43479
real    0m4.237s
user    0m27.337s
sys     0m0.608s

edited Jul 24 '21 at 17:46

answered Jul 24 '21 at 05:56

Laci

2,738
1
13
22

Thanks for the help, but I have to say that I'm a little confused by this answer. OpenMP is meant for shared-memory systems, how would running this code on a distributed-memory system help? It's not utilizing MPI; therefore, it will only run on a single multi-core processor, right? – mrmudd Jul 24 '21 at 16:46
@jjw you are right. I have modified my answer. I just wish to tell that your code is OK, but speed increase may significantly depend on the hardware used. – Laci Jul 24 '21 at 17:49

Using openMP to parallelize a loop over a vector c++ objects

1 Answers1

Linked