Analyzing a loaded lighttpd server serving from NFS

Question

Context:
- Server is a CentOS 5.2 x86_64 virtual machine with vmxnet3 ifaces, running on VSphere 4.1 on a Nehalem-based server (which is at half cpu and memory capacity according to the VCenter) with a 10 Gb network. Almost nil i/o on the virtual scsi disk of the VM according to iostat.
- Reads videos from an Isilon cluster, using NFS (atime is disabled)
- Serves them using lighttpd 1.5.0, which sits at 20% cpu. Around 650 HTTP connections, including 550 established, with an average of 100 Kb in Send-Q.

As we are loading the server with more request, cpu wait and irq are increasing. Memory isn't the problem.

Cpu0  :  0.0%us,  3.0%sy,  0.0%ni, 18.0%id,  0.0%wa, 32.0%hi, 47.0%si,  0.0%st
Cpu1  :  3.0%us,  4.0%sy,  0.0%ni, 55.4%id, 34.7%wa,  0.0%hi,  3.0%si,  0.0%st

4163 irq/s on the interface used by HTTP, and 2269 irq/s on the one for NFS, according to /proc/interrupts. For respectively 180 Mbps and 130 Mbps according to iptraf.

iostat for the NFS mount:

rBlk_nor/s   wBlk_nor/s   rBlk_dir/s   wBlk_dir/s   rBlk_svr/s   wBlk_svr/s    rops/s    wops/s
63737.87         0.00         0.00         0.00     61364.71         0.00   1098.04   1107.84

Hey, wops ? But no setattr and such on /proc/self/mountstats:

    opts:   ro,vers=3,rsize=32768,wsize=32768,acregmin=1200,acregmax=1200,acdirmin=1200,acdirmax=1200,hard,intr,proto=tcp,timeo=600,retrans=2,sec=sys
    age:    2405948
    caps:   caps=0x1,wtmult=8192,dtsize=4096,bsize=0,namelen=255
    sec:    flavor=1,pseudoflavor=1
    events: 3496282 32148506 1 1697 3176945 2598729 37924190 0 33339443 67286271 0 0 20 0 0 0 0 3176406 0 0 0 0 0 0 0
    bytes:  31773968205376 0 0 0 31969360034250 0 7805430344 0
    RPC iostats version: 1.0  p/v: 100003/3 (nfs)
    xprt:   tcp 779 0 50 250 0 1014646219 1014646203 0 8377876491 11916594888
    per-op statistics
            NULL: 0 0 0 0 0 0 0 0
         GETATTR: 3496282 3496282 0 461510280 391583584 2165765 2594488 5332330
         SETATTR: 0 0 0 0 0 0 0 0
          LOOKUP: 2598882 2598882 0 374792176 623714816 3558569 79355750 83640121
          ACCESS: 2824036 2824036 0 384066672 338884320 1788232 2276978 4482334
        READLINK: 0 0 0 0 0 0 0 0
            READ: 1005726981 1005726982 0 144824685416 32098094238420 7454826308 4671373832 13644100410
           WRITE: 0 0 0 0 0 0 0 0
          CREATE: 0 0 0 0 0 0 0 0
           MKDIR: 0 0 0 0 0 0 0 0
         SYMLINK: 0 0 0 0 0 0 0 0
           MKNOD: 0 0 0 0 0 0 0 0
          REMOVE: 0 0 0 0 0 0 0 0
           RMDIR: 0 0 0 0 0 0 0 0
          RENAME: 0 0 0 0 0 0 0 0
            LINK: 0 0 0 0 0 0 0 0
         READDIR: 0 0 0 0 0 0 0 0
     READDIRPLUS: 13 13 0 2132 23788 60 1240 1300
          FSSTAT: 2 2 0 256 336 0 0 0
          FSINFO: 1 1 0 128 164 0 10 10
        PATHCONF: 0 0 0 0 0 0 0 0
          COMMIT: 0 0 0 0 0 0 0 0

How to tell if the HTTP side or the NFS is the problem with the iowait and irq cpu usage ? Or how to tell if the VSphere host is reaching its I/O limits ?

score 1 · Accepted Answer · answered Sep 15 '10 at 21:16

1

nodiratime might also help, but, I thought that reported under setattr as well

cpu0 has very high hardware/software interrupts and cpu1 has reasonably high wait times for 200mb/sec even considering your dataset is mostly cold. If these are long videos, your rsize may be causing excessive fragmentation if the mtu of eth1 is small. I'm almost thinking your rsize is too high. You might run some tests from another workstation using dd and various mount options to see whether the 32768 is negatively impacting things.

Place a ticket with Isilon, their tech guys are quite good at debugging this as well.

answered Sep 15 '10 at 21:16

karmawhore

3,865
18
9

Thanks for the idea about rsize and fragmentation, will try it. – David142 Sep 24 '10 at 14:10
noatime is a superset of nodiratime: http://lwn.net/Articles/245002/ – David142 Sep 28 '10 at 12:09

Analyzing a loaded lighttpd server serving from NFS

1 Answers1