Cost of OpenMPI in C++

Question

I have the following program C++ program which uses no communication, and the same identical work is done on all cores, I know that this doesn't use parallel processing at all:

unsigned n = 130000000;
std::vector<double>vec1(n,1.0);
std::vector<double>vec2(n,1.0);
double precision :: t1,t2,dt;
t1 = MPI_Wtime();
for (unsigned i = 0; i < n; i++) 
{
  // Do something so it's not a trivial loop
  vec1[i] = vec2[i]+i;
}
t2 = MPI_Wtime();
dt = t2-t1;

I'm running this program in a single node with two Intel® Xeon® Processor E5-2690 v3, so I have 24 cores all together. This is a dedicated node, no one else is using it. Since there is no communication, and each processor is doing the same amount of (identical) work, running it on multiple processors should give the same time. However, I get the following times (averaged time over all cores):

1 core: 0.237

2 cores: 0.240

4 cores: 0.241

8 cores: 0.261

16 cores: 0.454

What could cause the increase in time? Particularly for 16 cores. I have ran callgrind and I get the roughly same amount of data/instruction misses on all cores (the percentage of misses are the same).

I have repeated the same test on a node with two Intel® Xeon® Processor E5-2628L v2, (16 cores all together), I observe the same increase in execution times. Is this something to do with the MPI implementation?

I don't quite understand what you are doing - are you running an individual instance of the the same program on each core separately? Also forst you are talking about Cores (24) and then about processors - which is it? — MikeMB, Feb 13 '16 at 18:07
Yes, I'm running, I'm running an individual instance of the same program on each core. Sorry about the mix of words, I've edited it. — datguyray, Feb 13 '16 at 18:20

Zulan · Answer 1 · 2016-02-13T18:53:36.247

4

Considering you are using ~2 GiB of memory per rank, your code is memory-bound. Except for prefetchers you are not operating within the cache but in main memory. You are simply hitting the memory bandwidth at a certain number of active cores.

Another aspect can be turbo mode, if enabled. Turbo mode can increase the core frequency to higher levels if less cores are utilized. As long as the memory bandwidth is not saturated, the higher frequency from turbo core will increase the bandwidth each core gets. This paper discusses the available aggregate memory bandwidth on Haswell processors depending on number of active cores and frequency (Fig 7./8.)

Please note that this has nothing to do with MPI / OpenMPI. You might as well launch the same program X times via any other mean.

edited Feb 13 '16 at 18:53

answered Feb 13 '16 at 18:46

Zulan

21,896
6
49
109

Thank you! This is exactly it. I'll look through the paper. – datguyray Feb 13 '16 at 19:11
I've skimmed through the paper. Quick question, do you know if it's possible for me to change the frequency of the CPU in C++? I'm just wondering how they're about to vary the frequency in Fig 7 and 8, or would I need some admin access or some sort. – datguyray Feb 14 '16 at 01:59
@user302157, by default this does require root. It is possible to enable this to users by setting permissions for `/sys/devices/system/cpu/cpu*/cpufreq/scaling_{governor,setspeed}`. You can either write the the files directly or `libcpufreq` (a bit slow because it always doublechecks the governor). – Zulan Feb 14 '16 at 07:03

gsamaras · Accepted Answer · 2016-02-13T18:55:39.997

2

I suspect that there are common resources that should be used by your program, so when the number of them increases, there are delays, so that a resource is free'ed so that it can be used by the other process.

You see, you may have 24 cores, but that doesn't mean that all your system allows every core to do everything concurrent. As mentioned in the comments, the memory access is one thing that might cause delays (due to traffic), same thing for disk.

Also consider the interconnection network, which can also suffer from many accesses. In conclusion, notice that these hardware delays are enough to overwhelm the processing time.

General note: Remember how Efficiency of a program is defined:

E = S/p, where S is the speedup and p the number of nodes/processes/threads

Now take Scalability into account. Usually programs are weakly scalable, i.e. that you have to increase with the same rate the size of the problem and p. By increasing only the number of p, while keeping the size of your problem (n in your case) constant, while keeping Efficiency constant, yields a strongly Scalable program.

edited Feb 13 '16 at 18:55

answered Feb 13 '16 at 18:05

gsamaras

71,951
46
188
305

2

This could be it. I don't know much about RAM access, but I know that all the cores share the same RAM. I don't know if the access to this memory can be done concurrently. – datguyray Feb 13 '16 at 18:27
Exactly @user302157, you see it has to do with the interconnection network too. ;) Scan the answers and select one to accept (also check my update). – gsamaras Feb 13 '16 at 18:55
Thank you! Yes, I'm actually trying to determine the cause of poor scaling in weak scaling tests. There's actually parallel code before the loop presented which inserts different data into the vectors depending on the rank, but I'm analysing each part separately and I was confused about this part which should scale very well under weak scaling. But the cause of the poor scaling when going to 16 cores is due to bandwidth issues. I didn't think of this. Thanks again. – datguyray Feb 13 '16 at 19:10
You are welcome @user302157. I would try to increase n as p is increasing. If that doesn't work, post a new question. Nice question btw, +1. – gsamaras Feb 13 '16 at 19:12
you mentioned interconnection network, does this terminology normally apply to the network linking two different nodes together? What about two CPUs on the same node? - Would this depend on the BUS speed? Thank you again! (Also, why can't I tag you with the @ symbol... @gsamaras ) – datguyray Feb 15 '16 at 19:24
@user302157 you can't because I am notified anyway, since you are commenting under my answer. Yes, that's it. It depends on how these CPUs are connected and a bus is one way that they could be connected. – gsamaras Feb 16 '16 at 22:06

score 1 · Answer 3 · edited Feb 13 '16 at 18:56

1

Your program is not using parallel processing at all. Just because you have compiled it with OpenMP does not make it parallel.

To parallelize the for loop, for example, you need to use the different #pragma's OpenMP offer.

unsigned n = 130000000;
std::vector<double>vec1(n,1.0);
std::vector<double>vec2(n,1.0);
double precision :: t1,t2,dt;
t1 = MPI_Wtime();

#pragma omp parallel for
for (unsigned i = 0; i < n; i++) 
{
  // Do something so it's not a trivial loop
  vec1[i] = vec2[i]+i;
}
t2 = MPI_Wtime();
dt = t2-t1;

However, take into account that for large values of n, the impact of cache misses may hide the perfomance gained with multiple cores.

edited Feb 13 '16 at 18:56

gsamaras

71,951
46
188
305

answered Feb 13 '16 at 18:06

mcleod_ideafix

11,128
2
24
32

2

Yes, I know that there is no parallel processing. I stated that there is no communication and same identical work is done. So I don't know why there is an increase in execution time even when each core is doing exactly the same thing. I've ran the program through callgrind and there is the same amount of instruction/data misses on all cores. – datguyray Feb 13 '16 at 18:16

Cost of OpenMPI in C++

3 Answers3