0

My server seems to get sudden high loads within second and when I run dmesg, following logs appear:

INFO: task auditd:2185 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
auditd        D 0000012D  2216  2185      1          2216  2184 (NOTLB)
       f7fcaed0 00000086 eb3e3159 0000012d 0000012c 0000000e 00000000 00000007
       c35e3550 eb3e396f 0000012d 00000816 00000002 c35e365c c3420788 f7897200
       c36d0468 00000000 00000000 f7fcaecc c041f0c8 00000000 00000000 00000003
Call Trace:
 [<c041f0c8>] __wake_up+0x2a/0x3d
 [<c043753f>] prepare_to_wait+0x24/0x46
 [<f885f2f1>] log_wait_commit+0x80/0xc7 [jbd]
 [<c04373f3>] autoremove_wake_function+0x0/0x2d
 [<f885a680>] journal_stop+0x196/0x1bb [jbd]
 [<c04968da>] __writeback_single_inode+0x199/0x2a5
 [<c045d7b0>] do_writepages+0x2b/0x32
 [<c0459283>] __filemap_fdatawrite_range+0x66/0x72
 [<c0496f78>] sync_inode+0x19/0x24
 [<f8892019>] ext3_sync_file+0xb1/0xdc [ext3]
 [<c0479211>] do_fsync+0x41/0x83
 [<c0479270>] __do_fsync+0x1d/0x2b
 [<c0404ee1>] sysenter_past_esp+0x56/0x79

I'm really struggling to understand what is wrong, following are the contents on /proc/meminfo

# cat /proc/meminfo 
MemTotal:      4148160 kB
MemFree:        119352 kB
Buffers:         14024 kB
Cached:        3362784 kB
SwapCached:         84 kB
Active:        2608268 kB
Inactive:      1217900 kB
HighTotal:     3273304 kB
HighFree:         9092 kB
LowTotal:       874856 kB
LowFree:        110260 kB
SwapTotal:     4096552 kB
SwapFree:      4096456 kB
Dirty:           50816 kB
Writeback:      270996 kB
AnonPages:      449592 kB
Mapped:         889840 kB
Slab:           154948 kB
PageTables:      32796 kB
NFS_Unstable:        0 kB
Bounce:            624 kB
CommitLimit:   6170632 kB
Committed_AS:  2463988 kB
VmallocTotal:   116728 kB
VmallocUsed:      6728 kB
VmallocChunk:   109892 kB
HugePages_Total:     0
HugePages_Free:      0
HugePages_Rsvd:      0
Hugepagesize:     2048 kB

Following is how the load looks like, it goes sometime even higher than this:

Tasks: 506 total,   1 running, 504 sleeping,   0 stopped,   1 zombie
Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 11.7%us,  1.9%sy,  0.0%ni, 86.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  1.0%us,  1.9%sy,  0.0%ni,  0.0%id, 96.1%wa,  0.0%hi,  1.0%si,  0.0%st
Cpu3  :  7.8%us,  1.0%sy,  0.0%ni, 91.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   4148160k total,  4033552k used,   114608k free,    13944k buffers
Swap:  4096552k total,       96k used,  4096456k free,  3361112k cached

Nothing seems to wrong with the disks as well:

# /usr/sbin/smartctl -q errorsonly -H -l selftest -l error /dev/sda
# /usr/sbin/smartctl -q errorsonly -H -l selftest -l error /dev/sdb
<nooutput>

I don't know how to debug this issue, if a process is causing this then how to find which process can be doing that or if some kernel parameter needs tuning then I don't know which kernel parameter shall I tune.

  • Your process hung while writing to the disk. Either your disks are way too slow for the load you're putting on them, or one of them _is_ having a problem and you should test them more thoroughly. – Michael Hampton Jan 24 '13 at 23:59

1 Answers1

1
Cpu2  :  1.0%us,  1.9%sy,  0.0%ni,  0.0%id, 96.1%wa,  0.0%hi,  1.0%si,  0.0%st

This is the most prominent performance problem that I figured out. Look out for the wa field, huge. That CPU is handling a huge chunk of IO.

The stack trace shows that it goes to log the metadata in the journal and then it gets stuck while waiting. Perhaps some other process has already grabbed the lock and not releasing it, resulting in a dead lock and the other process is spinning and spinning and contributing to the increased load average.

What you need to do is to collect fairly exhaustive data.

iostat -xdk 1 100

And usual sar data. Also, post the scheduler and the lun queue depth of the hard disks.

Soham Chakraborty
  • 3,584
  • 17
  • 24
  • The schedular and queue depth values are: # cat /sys/block/sd{a,b}/{device,queue}/{queue_depth,scheduler} 2>/dev/null `1` `noop anticipatory deadline [cfq]` `1` `noop anticipatory deadline [cfq]` I'm waiting for it to happen again and will post iostat data as well, meanwhile if you could guess the cause of problem from schedular/queue depth, that would be helpful Thanks – Imran Imtiaz Jan 25 '13 at 20:09
  • I don't know what kind of workload you are running. If it is read, you might want to try deadline. If it is a virtualized environment, try noop. These are rules of thumb and can change from scenario to scenario. – Soham Chakraborty Jan 25 '13 at 20:10
  • It's a real machine, the machine is used to run imagemagick software to convert images, so more writes are done. When the load goes high, at times I've seen all the four cores of the CPU waiting for I/O and not doing anything, but I'm unable to find exactly what is causing those I/O blocks, following is the output from iostat, however I've still not seen this issue happened that's why the load seems normal at the moment: – Imran Imtiaz Jan 26 '13 at 00:13
  • I've hit the problem again, here's how the machine looked like when it happened, 1) top output: http://pastebin.com/7ph6B5Hp 2) iostat output: http://pastebin.com/T1gzLHk3, 3) this time there was not output from dmesg. – Imran Imtiaz Jan 26 '13 at 00:41
  • and output from sar: http://pastebin.com/jnVSViq6 – Imran Imtiaz Jan 26 '13 at 00:48
  • Seeing the output of top I can see that some processes (incl. syslog) are stuck trying to read or write to a device. Could you sort your top output by *states* (S column) and post the result? In addition, could you try the command `perf top -a` wait 20 sec and see which processes are often coming at the top of the list? (perf needs often to be install) – Huygens Feb 11 '13 at 17:08