High Availability NFS Server (Heartbeat/DRBD) long hang on clients when takeover occurs

Question

We have a High Availability NFS environment using DRBD, heartbeat and nfs exposed to clients (Simular to the following https://help.ubuntu.com/community/HighlyAvailableNFS ). This seems to be a rather common and well supported method of doing HA nfs and is working really well for us with one expection.

When the heartbeat performs the switch over, the nfs clients all hang for approx 60-120 seconds. I can see that it is only taking 5-10 secs for heartbeat to complete the takeover and get the nfs up (I can even mount it manually). But the connected clients seem to wait for somesort of timeout before they re-establish a working connection.

I've tried the following without success;

Insured that /var/lib/nfs is stored on the drdb disk and symlinked back to /var/lib
UDP or TCP client connections
NFS server export defines the fsid in the export.
Playing with client timeo= in mount
Hard/Soft mounts

Setup is as follows;

NFSv4
Ubuntu LTS servers and clients
Current Client Mount options=proto=tcp,noauto,bg,intr,hard,noatime,nodiratime,nosuid,noexec

Notes

I've noticed that /var/lib/nfs/rmtab is always empty and i cant work out why. Could this be the reason?
Clients are GUI less ubuntu 10.4 LAMP stack Servers.
When the client stalls, any program which tries to access the share stalls. E.g. doing a "df" will hang the ssh session at the nfs mount line until the nfs comes back.

Any advise would be most welcome.

score 2 · Answer 1 · answered Dec 12 '11 at 18:02

2

If you're running Ubuntu with GUI user logins such as LTSP, it's very possible that the clients are the problem.

The Gnome-Settings-Daemon has a nasty habit of digging around inside the NFS mounts to check the state of any trash folders it finds. This problem exists in Ubuntu 9.10 and likely also in 10.04.

This is hard-coded in the Ubuntu distribution and was erroneously dropped in the 9.x releases. It is reported to be fixed in later Ubuntu releases and a common symptom is high load average while the NFS mounts are unreachable.

answered Dec 12 '11 at 18:02

Magellan

4,451
3
30
53

Thanks for the suggestion. In this case, the clients are GUI less. CPU usage on the clients is normal during the switchover, but the number of processes grow massively as each process waits for the nfs to be ok so it can serve its request. – leenix Dec 13 '11 at 10:18
Yeah, CPU never increases with ours either. Latency does as the load average increases at a rate of sessions * broken mounts because the GSD NFS requests are stacking up. – Magellan Dec 13 '11 at 19:11
AdrianK: With your setup, when a fallover happens how long would you estimate it takes for a client to be able use the mount normally? – leenix Dec 13 '11 at 20:53
This is the part that stinks. For our installation, it actually ended up being (Load Average * 3 minutes / # of mounts) where Load Average is Sessions * Mounts. – Magellan Dec 13 '11 at 21:13
Ouch, you almost made me feel like i got it good. From what ive read some people seem to have very minimal delay with my use case. (clients are just LAMP stacks). But i just cant seem to find the issue with ours – leenix Dec 13 '11 at 21:28

High Availability NFS Server (Heartbeat/DRBD) long hang on clients when takeover occurs

1 Answers1