I'm trying to increase the performance of a c++ code using openMP but am not seeing very good scaling. Before delving into the details of my code, I have a very general question that I think could save a lot of time if I can get a definitive answer to it.
The basic structure of the code is a vector of objects (let's say size num_objs = 5000) where each object holds a relatively small vector of doubles (let's say size num_elems = 500). I want to loop through this vector of objects, and for each object, perform a subloop on the member vector to modify each element. I am only attempting to parallelize the outer loop (over the objects) as this is the standard approach with openMP and this loop is much larger than the nested one.
So now for my question. Am I taking a severe performance hit by looping over the array of objects and then looping over each of their smaller member vectors? Should I expect significant increase in performance if I instead made one large vector of size num_objects * num_elems and then did a parallel loop over "chunks" of that big vector that would correspond to the member vectors stored in each object that I described above? That way both the outer loop and inner loop will be accessing data from one big vector rather than have to fetch data from separate objects?
The actual code is much more complicated than it would seem by the above description, so in order to try this alternative approach would require a lot of time modifying. Therefore, I just wanted to get a feel for how significant of a speedup I could get if I spent the time restructuring the entire code. I don't have a lot of knowledge about computer architecture, memory access, caches, etc., so apologies if this is painfully obvious.
EDIT: I was thinking there was possibly a simple answer to this; however, I see that's not really the case. Please consider the following (simplified example).
#include <cmath>
#include <ctime>
#include <iostream>
#include <omp.h>
#include <string>
#include <vector>
class Block {
public:
static double a;
std::vector<double> x;
std::vector<double> y;
Block(int N);
};
double Block::a = 5;
int main(int argc, char const *argv[]) {
int num_blocks = 80000;
int num_elems = 1000;
int num_iter = 100;
int nthreads = 1;
bool parallel_on = true;
omp_set_num_threads(nthreads);
std::vector<Block> block_vec;
for (int i = 0; i < num_blocks; i++) {
block_vec.push_back(Block(num_elems));
}
double start;
double end;
start = omp_get_wtime();
int iter = 0;
while (iter < num_iter) {
#pragma omp parallel for if (parallel_on)
for (int bl = 0; bl < num_blocks; bl++) {
for (int i = 0; i < num_elems; i++) {
block_vec[bl].x[i] = Block::a * block_vec[bl].y[i] + block_vec[bl].x[i];
}
}
iter++;
std::cout << "ITER: " << iter << std::endl;
}
end = omp_get_wtime();
double time_taken = end - start;
std::cout << "TIME: " << time_taken << std::endl;
return 0;
}
Block::Block(int N) {
x.assign(N, 2.0);
y.assign(N, 3.0);
}
I compile this program with:
g++ -fopenmp -O3 saxpy.cpp
I'm running it on an i7-6700 CPU @ 3.40GHz (four physical cores and eight logical cores). Here is the computational time for differing thread counts:
1 THREAD: 8.65s
2 THREAD: 7.37s
3 THREAD: 7.41s
4 THREAD: 7.65s
I did try a version of this code as I described above that makes use of one big vector rather than the nested loop; however, it was about the same result, actually a little slower.