2

Our NFS-shared file-system is locking up.

Please feel free to ask any questions you feel relevant. :)

At the time, there are a lot of processes in "disk sleep" state, and the load averages on our machines sky-rocket. The machines are responsive on SSH, but our the majority of our websites (apache+mod_php) just hang, as does our email system (exim+dovecot). Any websites which don't require write access to the file-system continue to operate.

The load averages continue to rise until some kind of time-out is reached, but for at least 10-15 minutes. I've seen load averages over 800, yet the machines are still responsive for actions which don't require writing to the shared file-system.

I've been investigating a variety of options, which have all turned out to be red-herrings: nagios, proftpd, bind, cron tasks.

I'm seeing these messages in the file server's system log:

Jul 30 09:37:17 fs0 kernel: [1810036.560046] statd: server localhost not responding, timed out
Jul 30 09:37:17 fs0 kernel: [1810036.560053] nsm_mon_unmon: rpc failed, status=-5
Jul 30 09:37:17 fs0 kernel: [1810036.560064] lockd: cannot monitor node2
Jul 30 09:38:22 fs0 kernel: [1810101.384027] statd: server localhost not responding, timed out
Jul 30 09:38:22 fs0 kernel: [1810101.384033] nsm_mon_unmon: rpc failed, status=-5
Jul 30 09:38:22 fs0 kernel: [1810101.384044] lockd: cannot monitor node0

Software involved:

VMWare, Debian lenny (64bit), ancient Red Hat (32 bit) (version 7 I believe), Debian etch (32bit)

NFS, apache2+mod_php, exim, dovecot, bind, amanda, proftpd, nagios, cacti, drbd, heartbeat, keepalived, LVS, cron, ssmtp, NIS, svn, puppet, memcache, mysql, postgres

Joomla!, Magento, Typo3, Midgard, Symfony, custom php apps

Mark Henderson
  • 68,823
  • 31
  • 180
  • 259
fredden
  • 393
  • 1
  • 10

1 Answers1

1

In that case, try to remount nfs partition. Is it exported with or without sync ?

Nikolaidis Fotis
  • 2,032
  • 11
  • 13
  • It's exported async:/home 10.0.17.0/24(rw,no_root_squash,async,no_subtree_check) – fredden Aug 01 '10 at 21:03
  • 1
    ok. and what's the mount parameters ? try with -o nolocks Also check at the NFS server's log. Propably a service is dead. Also the ps aux results of server would be very helpful. If you don't want to publish them, search if rpc.statd and portmap services work. – Nikolaidis Fotis Aug 02 '10 at 13:02