2

When I view top on one of our servers there are a lot of nfsd processes consuming CPU:

PID   USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
2769  root      20   0     0    0    0 R   20  0.0   2073:14 nfsd
2774  root      20   0     0    0    0 S   19  0.0   2058:44 nfsd
2767  root      20   0     0    0    0 S   18  0.0   2092:54 nfsd
2768  root      20   0     0    0    0 S   18  0.0   2076:56 nfsd
2771  root      20   0     0    0    0 S   17  0.0   2094:25 nfsd
2773  root      20   0     0    0    0 S   14  0.0   2091:34 nfsd
2772  root      20   0     0    0    0 S   14  0.0   2083:43 nfsd
2770  root      20   0     0    0    0 S   12  0.0   2077:59 nfsd

How do I find out what these are actually doing? Can I see a list of files being accessed by each PID, or any more info?

We're on Ubuntu Server 12.04.

I tried nfsstat but it's not giving me much useful info about what's actually going on.

Edit - Additional stuff tried based on comments/answers:

Doing lsof -p 2774 on each of the PIDs shows the following:

COMMAND  PID USER   FD      TYPE DEVICE SIZE/OFF NODE NAME
nfsd    2774 root  cwd       DIR    8,1     4096    2 /
nfsd    2774 root  rtd       DIR    8,1     4096    2 /
nfsd    2774 root  txt   unknown                      /proc/2774/exe

Does that mean no files are being accessed?


When I try and view a process with strace -f -p 2774 it gives me this error:

attach: ptrace(PTRACE_ATTACH, ...): Operation not permitted
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf

A tcpdump | grep nfs is showing tons of activity between two of our servers, over nfs, but as far as I'm aware they shouldn't be. A lot of entries like:

13:56:41.120020 IP 192.168.0.20.nfs > 192.168.0.21.729: Flags [.], ack 4282288820, win 32833, options [nop,nop,TS val 627282027 ecr 263985319,nop,nop,sack 3 {4282317780:4282319228}{4282297508:4282298956}{4282290268:4282291716}], len
BT643
  • 551
  • 3
  • 9
  • 21
  • 1
    In this kind of situation I often found very useful to capture the NFS traffic (e.g., with `tcpdump` or Wireshark) and have a look at it to see if there is a specific reason for the high load. – Ale Nov 19 '14 at 13:20
  • Interesting... a `tcpdump | grep nfs` is showing tons of activity between two of our servers, over nfs, but as far as I'm aware they shouldn't be. A lot of entries like: `13:56:41.120020 IP 192.168.0.20.nfs > 192.168.0.21.729: Flags [.], ack 4282288820, win 32833, options [nop,nop,TS val 627282027 ecr 263985319,nop,nop,sack 3 {4282317780:4282319228}{4282297508:4282298956}{4282290268:4282291716}], len` – BT643 Nov 19 '14 at 13:58
  • you can use something like `tcpdump -w filename.cap "port 2049"` to save only NFS traffic (being on port 2049) to a capture file, then you can open that file on a PC with Wireshark and analyze it more in detail -- the last time I had a similar problem, it was a bunch of computation jobs from the same user who was over disk quota, and the clients (18 different machines) were trying over and over to write, raising the load on the old NFS server very high – Ale Nov 19 '14 at 14:21
  • Answer posted :) I'm glad you solved the problem, NFS can be very tricky to debug! Especially when there is lot of activity but no actual disk access (like my over quota user). – Ale Nov 19 '14 at 16:36

3 Answers3

4

In this kind of situation I often found very useful to capture the NFS traffic (e.g., with tcpdump or Wireshark) and have a look at it to see if there is a specific reason for the high load.

For example, you can use something like:

tcpdump -w filename.cap "port 2049"

to save only NFS traffic (being on port 2049) to a capture file, then you can open that file on a PC with Wireshark and analyze it more in detail—the last time I had a similar problem, it was a bunch of computation jobs from the same user who was over disk quota, and the clients (18 different machines) were trying over and over to write, raising the load on the old NFS server very high.

Ale
  • 1,703
  • 17
  • 25
  • Sorry, thought I'd move my comment here before I realised you'd replied :) Thanks! I was able to track down the cause with tcpdump! It was caused by a stuck PHP script which happened to be accessing an NFS share on our second server. I don't think it was actually doing anything which is why it didn't really show in top, iotop, etc, but the amount of stuck processes on that mount seemed to be causing issues :) Thanks again! – BT643 Nov 19 '14 at 17:29
3

Couple of tools for you:

  • lsof shows you the open file handles
  • iotop shows the process-wise I/O statistics in the top manner
  • nethogs shows you the per-process network traffic
  • strace allows you to see what a process is doing
Janne Pikkarainen
  • 31,852
  • 4
  • 58
  • 81
  • Thanks! I've updated my original post with the output of a few of these. It's weird, as far as I can see with all of these, nothing much is happening nfs-wise.. so I have no idea why it's using CPU. – BT643 Nov 19 '14 at 13:48
0

Another useful tool is strace - it will show all the system calls (file accesses etc.) that a process (and its forked children) is making. For example:

[root@localhost ~]# strace -f -p 2770

(but expect a lot of output)

  • For some reason I'm getting `attach: ptrace(PTRACE_ATTACH, ...): Operation not permitted Could not attach to process. If your uid matches the uid of the target process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf` even though I'm running it as root? – BT643 Nov 19 '14 at 13:44