HPC / EC2 - optimizing NFS for relialibility

Question

In AWS-EC2, I've set-up a cluster of linux virtual machines made of an NFS fileserver and many clients. If the number of clients is above ~20, under heavy I/O, I am experiencing loss of file integrity: e.g. gzipped files written by a client to the server are corrupted.

I am wondering what is the best set of NFS parameters to increase reliability of data transfer in this environment.

For now the mount flags are:

Flags:  rw,vers=3,rsize=262144,wsize=262144,hard,proto=tcp,timeo=600,retrans=2

The MTU size is 1500, the number of NFS deamons is 8.

Should I decrease rsize & wsize below MTU, and increase the number of NFS deamons?

Is there anything else that can be improved ?

Many thanks.

@mark-wagner. Thanks Mark. This cluster performs mathematical analyses. Results from each client are written into the same folder, but into specific files. There is no overlap — Olivier Delrieu, Sep 13 '13 at 15:51
I know this is old but had you checked if disk access was via async? Try sync in exports i.e.: /sharedfolder *(rw, no_root_squash, sync) — B. Shea, Nov 12 '15 at 19:02

score 0 · Answer 1 · answered Jul 15 '15 at 15:45

For the size of the cluster being used it may be a good idea to consider moving to a parallel file system like gluster. Alternately if the cluster is configured correctly every node should be aware of every other node either via DNS or by lookups to /etc/hosts and should have appropriate ssh keys to access them without a password.

If this is the case each node could simply copy the files upon completion of computation/compression, which would negate the need for NFS. While this solution will probably not give optimal performance, depending on the nature of the virtualization of compute nodes, network and storage it may be a good option.

What sort of cluster management/provisioning system are you using? Normally during the setup of a head node appropriate shared storage for compute nodes is also setup. Using a tool like Warewulf or ROCKS might help to ensure that compute nodes are provisioned correctly and their are many guides and reference designs for setting up clusters with these tools available online.

HPC / EC2 - optimizing NFS for relialibility

1 Answers1