0

Our legacy production system has a data server and a compute client that interact via NFSv4; both run on VMWare virtualization.

We have determined that as the compute client’s software runs, over the course of days, the number of DELEG locks (as measured with lslocks | wc) on the data server increases over time.

The number of DELEG READ locks can transiently spike to 100,000; over the longer term, the number of DELEG locks increase. There are practically no DELEG WRITE locks. As the number of DELEG locks on the data server hit about 15,000 or so, the compute client becomes barely responsive; most processes are in “D” state. Load avg. stays moderate, but CPU utilization drops from about 90% to 10% or less.

Killing the jobs on the compute client does not reduce the lock count even after several minutes. Shutting down or rebooting from the command line typically hangs; unmounting of NFS filesystems and /tmp can fail.

A hard reset often results in a failure to mount NFS mounts on reboot; but the lock count on the data server will drop dramatically. A hard power-down plus a 2 minute wait will allow the VM to be brought back up including all NFS mounts.

The data server is running SLES 12 (4.1.16-2.g1a0d915-default); it has 12 vCPUs and 128GB of RAM and heavily uses a mix of SSD- and HD-backed SAN. The data server version of nfs-kernel-server is 1.3.0-9.1. The data server computes about 100k data files each day. Some of these files are overwritten and renamed by the server itself, but in a way that either the primary file or a transient copy of the primary file should be available. The client code is aware of this and fails over if needed.

The compute client is running Debian Stable Stretch (4.9.0-6-amd64). It has 24 vCPUs and 128 GB of RAM. The nfs-common client version is 1:1.3.4-2.1. Via NFS, reads these data files, both the active set of 100k files generated daily as well as historically produced files. A single file may be read by many processes at the same time. Most files are accessed by small bash scripts that access 1 or 2 files at a time and reshape the data for upstream consumption. These scripts are called by programs in python and perl with lifetimes on the order of hours.

The data server has exports like:

/mnt/ssd  172.26.188.199(rw,wdelay,root_squash,no_subtree_check,sec=sys,rw,secure,root_squash,no_all_squash)

The compute client has fstab entries like:

172.26.188.198:/mnt/ssd           /mnt/ssd        nfs     defaults        3 3

The compute client has a more straight-forward architecture than the data server. An earlier VM instance running SLES experienced similar problems; this prompted a complete overhaul of the OS and compute jobs. This helped reduce but not eliminate the problem.

The next obvious step is to port the existing data server to a new OS, but this is a heavy lift with uncertain payoff. Because the behavior spans two different compute client boxes running different Linux versions and different software we suspect the problem must be one of: the server OS or NFS version, the high-frequency overlapping access pattern of the client, the file re-write pattern of the data server (maybe if the data server were split out the NFS server completely the nfsd internal state would be corrected), a possible bug in NFS or an error in how we have it configured. I am also not willing to dismiss possible issues in our ethernet or SAN, but consider these unlikely at present.

Besides an overhaul of the server or switching to a distributed file system or or setting up a regular reboot interval, what short term fixes might help keep the DELEG lock count low on our data server?

1 Answers1

0

To address this, we have tried changing files servers (still NFSv4), hardware and kernel versions (now linux 4.2.8 on a NAS from 4.1.16 on x86/VMWare/SLES). Despite these changes, the accumulation of NFS4 DELEG locks still occurred -- presumably it is a problem rooted in a software structure that worked on purely local (direct) storage. Ultimately, we abandoned NFSv4 and now use NFSv3 with the nolock option; we also use noatime,nodiratime,noacl for performance reasons. Locks no longer accumulate resulting in catastrophic slowdown seen before.