When looking at a scientific cluster and performance here are some of the main bottlenecks I see:
- What kind of networking do you have. Yes you say you have gigabit ethernet but are you using non-blocking switches so that every node on the switch can get full line rates?
- Are you using a distributed file system or an optimized NAS?
- Are all your links at full line rate? Again this goes back to the first point but you'd be surprised at what you'll find running iperf occasionally on nodes
- What is your latency. This can come up as a problem from time to time with gigabit networks if you have network issues and can really put a damper on applications that need to use MPI.
- What is your settings for your mainline device in
network-scripts
? Is your MTU set to 9000?
Iperf can normally be found on RHEL systems at
/apps/rhel5/iperf/bin/iperf
To run iperf first setup a server on a node.
/apps/rhel5/iperf/bin/iperf -s
Then from the node you wish to test the link do
/apps/rhel5/iperf/bin/iperf -c <host or IP of server>
If successful you will see output like this on the client:
------------------------------------------------------------
Client connecting to <host or IP of server>, TCP port 4200
TCP window size: 256 KByte (default)
------------------------------------------------------------
[ 3] local 123.11.123.12 port 4400 connected with 123.456.789.12 port 4200
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 1.1 GBytes 1.01 Gbits/sec
If you don't have it installed it can easily be retrieved on many platforms from repositories and is freely available to download and compile from source if it isn't. Run this on each node to see if there is a problem with the actual ethernet wire. After that run it on all the nodes to see if it bogs down the switch.