MPI speedup with trivially parallelizable DO-loop (F90)

Question

I have a simple DO-loop (Fortran 90) in which the individual iterations are independent of each other and only input/output data from/to the hard drive (the processes do not exchange messages/MPI between each other) which I have parallelized using MPI. In sequential run one iteration of the loop takes about one day to complete. If I run 29 of such iterations in parallel, it takes about 2.5 days. It's on one node of a supercomputer (i.e. no internode communication).

I've heard people telling that in case of trivially parallelizable programs (independent steps in loops) the total execution time should be approximately close to the execution time when you run just one step in the loop.

Question: Does this speedup look OK to you?

Thanks a lot.

How many cores do you have on your computing node? How many processors do you request when starting your run? What is the run time of the sequential part of your code? — innoSPG, Jul 29 '15 at 22:35
Runtime of the sequential part is negligible, it's just some simple POST processing. — Boki, Jul 29 '15 at 22:39
There are 32 cores per HPC node. I have tested the loop with 29 iterations on 29 cores, so one iteration per core. — Boki, Jul 29 '15 at 22:40
This speedup does not look ok to me. How much I/O do you have? — Ross, Jul 30 '15 at 00:39
If the processes do not engage in message passing why did you use MPI ? I can think of 101 ways for a beginner to make a program slower with MPI, but I can't think why you used MPI at all ! — High Performance Mark, Jul 30 '15 at 16:02
Each iteration in the DO loop creates gignatic matrix and then solves eigenvalues/eigenvectors and saves them to the HD. I'll see to time that separately from the eigensolver. Will report. — Boki, Jul 30 '15 at 17:51
Why I use MPI if the iterations are independent? It was easier to do that than to create separate input files and run them separately and then collate the output. — Boki, Jul 30 '15 at 17:54
In actual application, I will have to do many simulations with each having 150-200 iterations per DO-loop. The 29 is just a test. — Boki, Jul 30 '15 at 17:57

score 1 · Answer 1 · answered Jul 29 '15 at 22:54

1

Since you have independent iterations, your runtime for 29 iterations on 29 cores should not be fare from the runtime of a single iteration on a single core. You should be close to about a day unless one or more of the following conditions apply:

you do not have enough memory on your computing node for all the processes and their data to fit in memory;
the computations are not balanced between iterations;
there is a lot of disk input/output that create a race on disk access.
and possibly some other that I do not have in mind.

answered Jul 29 '15 at 22:54

innoSPG

4,588
1
29
42

@Boki: Memory bandwidth could be an issue, with 29 copies of the same algorithm reading / writing their own memory at the same time. This is why in a case like this, it would be potentially better (but much harder) to look for parallelism within a single iteration. – Peter Cordes Jul 30 '15 at 17:07
1

I turned this into an answer. – Peter Cordes Jul 30 '15 at 17:15
Thanks for the reply. This case with 29 iterations used less than 90GB out of 256GB on the node. So, there was enough memory. – Boki Jul 30 '15 at 17:59
The 29 individual processes were taking identical job (for testing purposes), so the load was balanced. – Boki Jul 30 '15 at 18:02
I've timed individual processes, additionally I'll time the I/O in each process separately from the math part. Will report on that. – Boki Jul 30 '15 at 18:03
I've repeated the simulation, but now timing each core and individual tasks in the code. HD i/o has negligible contribution to the entire latency. All the latency is in the math (creating the gignantic matrix and then finding the eigenvaolues/eigenvectors). The latency for the 30 cores is relatively uniformly distributed. So, most likely it's RAM access speed, just as you have pointed out, Peter. Thanks for that advise. – Boki Jul 31 '15 at 00:44
So, I guess the only way to get big simulations to run faster is to chop them in several smaller ones (each smaller utilizing MPI) and then manually run them in parallel. – Boki Jul 31 '15 at 00:46
Given the size of the entire memory, disk I/O is not a good candidate for the source of the problem. The memory bandwidth as suggested by Peter Cordes is potentially the best candidate for the source of the problem. – innoSPG Jul 31 '15 at 00:50

score 1 · Answer 2 · answered Jul 30 '15 at 17:15

1

So you're only running about half as fast as you hoped, when scaling up to 29 parallel copies of your code?

Memory bandwidth could be an issue, with 29 copies of the same algorithm reading / writing their own memory at the same time. This is why in a case like this, it would be potentially better (but much harder) to look for parallelism within a single iteration.

Let's use video encoding as a specific example of what "one iteration" might be. For example, encoding 29 videos in parallel is like what the OP's proposing. Having x264 use 32 cores to encode one video, then repeating that for the next 28 vids, uses much less total RAM, and caches better.

In practice, maybe 2 or 3 vids in parallel, each using 10 to 16 threads, would be good, since there's a limit to how much parallelism x264 can find.

It depends on the algo, and how well it scales up with multiple threads. If not at all, or you don't have time to code it, then brute force all the way. A factor of over 10 speedup is nothing to sneeze at for basically no effort. (e.g. running a single-threaded program on different data-sets with make -j29 or GNU parallel, or in your case using multiple threads in a single program. :)

While your code is running, you could check on CPU utilization to make sure you're keeping 29 CPU cores busy like you're trying to. You could also use a profiling tool (like Linux perf) to investigate cache effects. If a parallel run has a lot more than 29 times the data cache misses of a single-threaded run, this would start to explain things.

answered Jul 30 '15 at 17:15

Peter Cordes

328,167
45
605
847

Thanks for the reply. In my case, individual iteration is solving eigenvalues/eigenvectors of a gigantic matrix and then writing some of them into a file on the HD. For eigensolver I use well developed and tuned (canned) function that I can't parallelize. – Boki Jul 30 '15 at 18:08
@Boki: Ok, so probably memory bandwidth / cache size limitations. Or if you're using so much memory that the system runs out of RAM and has to page to disk, you could get a speedup by running fewer in parallel. – Peter Cordes Jul 30 '15 at 18:12
2

Unless you can find a good parallel library that solves the eigenproblem, just call it a day and take your factor-of-10 speedup. Or if you can arrange for your program to spread its work to multiple compute nodes, that should help. (As long as the cluster has nodes that were otherwise idle, or some of the other jobs running aren't memory intensive.) But anyway, you aren't guaranteed a linear speedup, even if you aren't "doing anything wrong" with MPI, because memory / cache is a shared resource, so you can just not worry about it and get on with your research :). – Peter Cordes Jul 30 '15 at 18:19
@PeterCordes, good that you thought about the memory bandwidth. It is potentially the issue. – innoSPG Jul 30 '15 at 18:21
I've repeated the simulation, but now timing each core and individual tasks in the code. HD i/o has negligible contribution to the entire latency. All the latency is in the math (creating the gignantic matrix and then finding the eigenvaolues/eigenvectors). The latency for the 30 cores is relatively uniformly distributed. So, most likely it's RAM access speed, just as you have pointed out, Peter. Thanks for that advise. – Boki Jul 31 '15 at 00:47

MPI speedup with trivially parallelizable DO-loop (F90)

2 Answers2