Appropriate network file system for huge (5+ Gb) files

Question

I've got a number of servers used for HPC / cluster computing and I noticed that, given the fact that part of the computations they run use huge files over NFS, this causes significant bottlenecks. I'm wondering how to address the issue.

The setup:

34 servers with Debian Squeeze (42 Gb RAM each)
12 physical cores per machine + HT
2 "head" machines (head1 and head2) with 500 Gb drives each
32 "slave" machines which do PXE boot from head1
head1 exports the NFS file system for the 32 PXE servers
head2 exports a "data" directory via NFS which contains data files for all the other machines
the "data" directory contains very large files (5+ Gb)
connectivity between the machines: Gigabit Ethernet
most machines are not in the same physical rack
Uses the Open Grid Scheduler (aka Grid Engine) for batch job processing

One of the computations that this cluster runs involves, for each of the "slaves", reading a very large sets of files (3Gb + 3Gb + 1.5 Gb + 750M) before starting the various calculations. I've noticed that when this happens, most of the slaves are actually spending significant time (several minutes) when reading these (while the actual computation is much faster).

Currently, I've raised the number of threads in NFS daemon of head2 and put rsize and wsize to 32k in the slave mount options, but still it's a significant bottleneck.

What can I do to improve performance, or should I let the slaves host these files on their hard disks? Or should I go with an entirely different FS for storage?

Are these nodes working with the same data set or different sets at a given time? I haven't worked with this kind of clustering myself but a friend worked with map topology data and using multicast to distribute his data helped enormously. — Tim Brigham, Feb 19 '13 at 12:24

score 2 · Answer 1 · answered Feb 19 '13 at 08:57

This is most likely not a limitation of NFS you are encountering here.

Also take into account that those 5 GBytes take at the very, very least 40s to transfer at gigabit wire speed - for each client. You have 32 of them hammering the head2, and they're not likely to request the same blocks at the same time. Add Ethernet, TCP/UDP and NFS overhead, and you'll soon run into the minutes you described.

So, before you try to swap out NFS with anything else (yes, there are protocols with less overhead), check each part of the path the data (start at the disk subsystem) takes for any possible bottlenecks. Benchmark if in doubt.

Removing those bottlenecks (if any) with additional or better hardware will be easier than to change your whole software setup.

I did some I/O monitoring but not exhaustive. I'll check. – Einar Feb 19 '13 at 09:00 — Einar, Feb 19 '13 at 09:00

score 2 · Accepted Answer · answered Feb 19 '13 at 09:02

Since you are doing performance analysis, the first question should be: "what is the data I am basing the assumption on? Are there network traces or other performance data that would support this hypothesis?"

There are a lot of possible bottlenecks in such a system, and I would question the choice of the network filesystem last, especially since you do not appear to write significant amounts of data and locking / concurrency and the accompanying latency issues would be the most likely bottleneck causes with NFS.

On the other hand, 32 concurrent requests for 8 GB of data each are likely to overload any single SATA disk due to the rather limited IOPS rating of a single disk. A simple calculation assuming a reading block size of 64 KB per request and 100 IOPS for the disk would yield a rate of just 6,4 MB/s for random read requests - which is what you will be getting with that number of simultaneous readers unless you are caching the data heavily.

You should take a good look at performance indicators provided by iostat to see if your disk is not being overloaded. And if it is, take appropriate measures (i.e. get a decent storage subsystem capable of coping with the load) to remedy the situation.

score 1 · Answer 3 · answered Feb 19 '13 at 08:57

1

I have an environment that is quite similar (lots of blade servers as worker nodes, and huge files on each several GB or even TB). I use Hadoop Distributed File System (HDFS). Check out:

http://en.wikipedia.org/wiki/Hadoop_Distributed_File_System#Hadoop_Distributed_File_System

http://hadoop.apache.org/docs/r0.18.0/hdfs_design.pdf

You might find it a bit more complex to set up than NFS though.

answered Feb 19 '13 at 08:57

Thomas

175
3
9

I have a prototype set up elsewhere for Hadoop, indeed it's a bit inconvenient. I also wonder if it's feasible, given that I have access only to the disks in head1 and head2. – Einar Feb 19 '13 at 09:02
Oh, I assumed that the disks in the worker nodes could be utilized as well. If you can somehow utilize even small disks in the worker nodes, you can get amazing performance with HDFS. – Thomas Feb 19 '13 at 09:06

Appropriate network file system for huge (5+ Gb) files

3 Answers3

Linked