Optimizing Linux Compute Cluster

Question

I am setting up a supercomputing Linux cluster at work. We ran the most recent HPCC benchmarks using OpenMPI and GoToBlas2 but got really bad results. When I ran the benchmarks using one process for every core in the cluster, the results were much worse (more than 100X) than running the benchmark in a single process. This is clearly not the kind of performance we expected. My only thought is that MPI is taking too long to transfer messages among the processes. Does anybody have any ideas of how I can optimize the server setup so that the performance doesn't suck so much?

We are using the Rocks cluster distribution with OpenMPI v1.4.3. Our compute nodes are Dell rack-mount servers with two quad-core Intel Xeon processors each. They are connected by gigabit ethernet cables.

Wilshire · Answer 1 · 2011-06-22T19:14:47.170

1

When looking at a scientific cluster and performance here are some of the main bottlenecks I see:

What kind of networking do you have. Yes you say you have gigabit ethernet but are you using non-blocking switches so that every node on the switch can get full line rates?
Are you using a distributed file system or an optimized NAS?
Are all your links at full line rate? Again this goes back to the first point but you'd be surprised at what you'll find running iperf occasionally on nodes
What is your latency. This can come up as a problem from time to time with gigabit networks if you have network issues and can really put a damper on applications that need to use MPI.
What is your settings for your mainline device in network-scripts? Is your MTU set to 9000?

Iperf can normally be found on RHEL systems at

/apps/rhel5/iperf/bin/iperf

To run iperf first setup a server on a node.

/apps/rhel5/iperf/bin/iperf -s

Then from the node you wish to test the link do

/apps/rhel5/iperf/bin/iperf -c <host or IP of server>

If successful you will see output like this on the client:

------------------------------------------------------------
Client connecting to <host or IP of server>, TCP port 4200
TCP window size:   256 KByte (default)
------------------------------------------------------------
[  3] local 123.11.123.12 port 4400 connected with 123.456.789.12 port 4200
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.1 GBytes  1.01 Gbits/sec

If you don't have it installed it can easily be retrieved on many platforms from repositories and is freely available to download and compile from source if it isn't. Run this on each node to see if there is a problem with the actual ethernet wire. After that run it on all the nodes to see if it bogs down the switch.

edited Jun 22 '11 at 19:14

answered Jun 20 '11 at 18:39

Wilshire

538
6
19

Wilshire, thanks for your answer. I didn't set up the networking (it was setup before I started) so I'm not really sure about the specifics. I'm pretty sure it is a latency issue but I am not certain how to fix that. I'll talk to my boss about the networking and see if he can answer your questions. – Zhehao Mao Jun 20 '11 at 18:45
As for the results of the HPCC benchmark. I've put them [here](https://gist.github.com/a9675d06a5ae64bce801). I ran the benchmarks using 128 processes on 16 8-core compute nodes. – Zhehao Mao Jun 20 '11 at 18:48
Just browsing the HPL output stuff looked really off. You should be getting at least over 700 Gflops on 128 processors. – Wilshire Jun 20 '11 at 19:10
Yeah, I know, that's what my boss said. Instead the performance goes in the opposite direction. With just one processor, it get 8 Gflops, but with 128 processors it gets only 30 Mflops. – Zhehao Mao Jun 20 '11 at 19:18
Yeah, if this is the case there might be something wrong with the network. Run iperf on all the nodes to check for bad cabling. – Wilshire Jun 20 '11 at 19:22
Where is iperf located? It doesn't seem to be on my path. – Zhehao Mao Jun 20 '11 at 19:24
Added the instructions and where to find iperf in the answer. – Wilshire Jun 20 '11 at 19:48
As for the network setup. The private network uses Gigabit ethernet, dedicated to the 16 compute nodes and the head node. It uses a netgear JGS524 switch. – Zhehao Mao Jun 20 '11 at 20:05
I ran iperf and none of the compute nodes seem to have an issue (they were all over 900 Mbits/sec). Do you have any other suggestions? – Zhehao Mao Jun 22 '11 at 16:01
Can you setup a script that would execute iperf on all the machines at the same time to see if it isn't a switch problem? – Wilshire Jun 22 '11 at 17:41
You were right, when iperf was run on all the nodes at once, the bandwidth dropped down to about 50 Mbits/sec. – Zhehao Mao Jun 22 '11 at 17:48

Optimizing Linux Compute Cluster

1 Answers1