2

Occasionally one of our Linux MySQL database servers will hang for a while, creating a long queue of active queries. After 10-20 mins, service resumes as normally.

This happens in kern.log:

May 14 13:58:05 edu02 kernel: [2375521.716786] INFO: task jbd2/md2-8:1419 blocked for more than 120 seconds.
May 14 13:58:05 edu02 kernel: [2375521.716870] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 14 13:58:05 edu02 kernel: [2375521.716953] jbd2/md2-8    D ffff88043b5a23f0     0  1419      2 0x00000000
May 14 13:58:05 edu02 kernel: [2375521.716958]  ffff88043b5a23f0 0000000000000046 00000000000000ff ffff88043b4e69c0
May 14 13:58:05 edu02 kernel: [2375521.716963]  00000000ffffffff ffffffff8118bac3 000000000000f9e0 ffff88043d099fd8
May 14 13:58:05 edu02 kernel: [2375521.716967]  00000000000157c0 00000000000157c0 ffff88043a411cc0 ffff88043a411fb8
May 14 13:58:05 edu02 kernel: [2375521.716971] Call Trace:
May 14 13:58:05 edu02 kernel: [2375521.716981]  [<ffffffff8118bac3>] ? generic_make_request+0x299/0x2f9
May 14 13:58:05 edu02 kernel: [2375521.716988]  [<ffffffff810168ec>] ? read_tsc+0xa/0x20
May 14 13:58:05 edu02 kernel: [2375521.716994]  [<ffffffff810757f6>] ? timekeeping_get_ns+0xe/0x2e
May 14 13:58:05 edu02 kernel: [2375521.716999]  [<ffffffff8111a5f2>] ? sync_buffer+0x0/0x40
May 14 13:58:05 edu02 kernel: [2375521.717004]  [<ffffffff8130b4a2>] ? io_schedule+0x73/0xb7
May 14 13:58:05 edu02 kernel: [2375521.717007]  [<ffffffff8111a62d>] ? sync_buffer+0x3b/0x40
May 14 13:58:05 edu02 kernel: [2375521.717010]  [<ffffffff8130b9af>] ? __wait_on_bit+0x41/0x70
May 14 13:58:05 edu02 kernel: [2375521.717014]  [<ffffffff8111a5f2>] ? sync_buffer+0x0/0x40
May 14 13:58:05 edu02 kernel: [2375521.717017]  [<ffffffff8130ba49>] ? out_of_line_wait_on_bit+0x6b/0x77
May 14 13:58:05 edu02 kernel: [2375521.717021]  [<ffffffff81066210>] ? wake_bit_function+0x0/0x23
May 14 13:58:05 edu02 kernel: [2375521.717035]  [<ffffffffa0138a22>] ? jbd2_journal_commit_transaction+0xbe5/0x115b [jbd2]
May 14 13:58:05 edu02 kernel: [2375521.717043]  [<ffffffffa013e8c1>] ? kjournald2+0xbe/0x206 [jbd2]
May 14 13:58:05 edu02 kernel: [2375521.717046]  [<ffffffff810661e2>] ? autoremove_wake_function+0x0/0x2e
May 14 13:58:05 edu02 kernel: [2375521.717052]  [<ffffffffa013e803>] ? kjournald2+0x0/0x206 [jbd2]
May 14 13:58:05 edu02 kernel: [2375521.717055]  [<ffffffff81065f15>] ? kthread+0x79/0x81
May 14 13:58:05 edu02 kernel: [2375521.717058]  [<ffffffff81011baa>] ? child_rip+0xa/0x20
May 14 13:58:05 edu02 kernel: [2375521.717062]  [<ffffffff81065e9c>] ? kthread+0x0/0x81
May 14 13:58:05 edu02 kernel: [2375521.717064]  [<ffffffff81011ba0>] ? child_rip+0x0/0x20
May 14 13:58:05 edu02 kernel: [2375521.717080] INFO: task mysqld:5879 blocked for more than 120 seconds.
May 14 13:58:05 edu02 kernel: [2375521.717132] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 14 13:58:05 edu02 kernel: [2375521.717215] mysqld        D ffff88000facfb30     0  5879   2279 0x00000000
May 14 13:58:05 edu02 kernel: [2375521.717219]  ffff88043e46f300 0000000000000082 ffff88043c703d08 ffff88043c703d04
May 14 13:58:05 edu02 kernel: [2375521.717223]  0000000000000008 ffffffffa017942a 000000000000f9e0 ffff88043c703fd8
May 14 13:58:05 edu02 kernel: [2375521.717226]  00000000000157c0 00000000000157c0 ffff88043ad09cc0 ffff88043ad09fb8
May 14 13:58:05 edu02 kernel: [2375521.717230] Call Trace:
May 14 13:58:05 edu02 kernel: [2375521.717247]  [<ffffffffa017942a>] ? ext4_journal_start_sb+0xd4/0x10e [ext4]
May 14 13:58:05 edu02 kernel: [2375521.717256]  [<ffffffffa0179350>] ? __ext4_journal_stop+0x63/0x69 [ext4]
May 14 13:58:05 edu02 kernel: [2375521.717264]  [<ffffffffa01634de>] ? ext4_da_writepages+0x4e2/0x65c [ext4]
May 14 13:58:05 edu02 kernel: [2375521.717269]  [<ffffffff810be235>] ? sync_page+0x0/0x46
May 14 13:58:05 edu02 kernel: [2375521.717272]  [<ffffffff8130b4a2>] ? io_schedule+0x73/0xb7
May 14 13:58:05 edu02 kernel: [2375521.717275]  [<ffffffff810be276>] ? sync_page+0x41/0x46
May 14 13:58:05 edu02 kernel: [2375521.717278]  [<ffffffff8130b9af>] ? __wait_on_bit+0x41/0x70
May 14 13:58:05 edu02 kernel: [2375521.717282]  [<ffffffff810be3fa>] ? wait_on_page_bit+0x6b/0x71
May 14 13:58:05 edu02 kernel: [2375521.717285]  [<ffffffff81066210>] ? wake_bit_function+0x0/0x23
May 14 13:58:05 edu02 kernel: [2375521.717289]  [<ffffffff810c6222>] ? pagevec_lookup_tag+0x1a/0x21
May 14 13:58:05 edu02 kernel: [2375521.717293]  [<ffffffff810bebda>] ? wait_on_page_writeback_range+0x69/0x11b
May 14 13:58:05 edu02 kernel: [2375521.717297]  [<ffffffff810bed00>] ? __filemap_fdatawrite_range+0x4b/0x54
May 14 13:58:05 edu02 kernel: [2375521.717301]  [<ffffffff810bed4c>] ? filemap_write_and_wait_range+0x43/0x52
May 14 13:58:05 edu02 kernel: [2375521.717305]  [<ffffffff81118106>] ? vfs_fsync_range+0x55/0x9e
May 14 13:58:05 edu02 kernel: [2375521.717308]  [<ffffffff811181ce>] ? do_fsync+0x28/0x39
May 14 13:58:05 edu02 kernel: [2375521.717311]  [<ffffffff811181fd>] ? sys_fsync+0xb/0x10
May 14 13:58:05 edu02 kernel: [2375521.717315]  [<ffffffff81010b42>] ? system_call_fastpath+0x16/0x1b
May 14 14:04:05 edu02 kernel: [2375881.716569] INFO: task mysqld:20799 blocked for more than 120 seconds.
May 14 14:04:05 edu02 kernel: [2375881.716626] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 14 14:04:05 edu02 kernel: [2375881.716710] mysqld        D 0000000000000000     0 20799   2279 0x00000000
May 14 14:04:05 edu02 kernel: [2375881.716716]  ffff88043e46d640 0000000000000082 0000000000000000 ffff8800839be930
May 14 14:04:05 edu02 kernel: [2375881.716720]  ffff88043b4e4680 ffff88043b4e4680 000000000000f9e0 ffff88009592ffd8
May 14 14:04:05 edu02 kernel: [2375881.716724]  00000000000157c0 00000000000157c0 ffff880014059cc0 ffff880014059fb8
May 14 14:04:05 edu02 kernel: [2375881.716729] Call Trace:
May 14 14:04:05 edu02 kernel: [2375881.716740]  [<ffffffff81066103>] ? bit_waitqueue+0x10/0xa0
May 14 14:04:05 edu02 kernel: [2375881.716766]  [<ffffffffa01371ba>] ? do_get_write_access+0x22c/0x452 [jbd2]
May 14 14:04:05 edu02 kernel: [2375881.716771]  [<ffffffff81066210>] ? wake_bit_function+0x0/0x23
May 14 14:04:05 edu02 kernel: [2375881.716777]  [<ffffffffa0137402>] ? jbd2_journal_get_write_access+0x22/0x33 [jbd2]
May 14 14:04:05 edu02 kernel: [2375881.716796]  [<ffffffffa017fee6>] ? __ext4_journal_get_write_access+0x4e/0x56 [ext4]
May 14 14:04:05 edu02 kernel: [2375881.716804]  [<ffffffffa015fe4c>] ? ext4_reserve_inode_write+0x37/0x73 [ext4]
May 14 14:04:05 edu02 kernel: [2375881.716811]  [<ffffffffa015fec3>] ? ext4_mark_inode_dirty+0x3b/0x1c4 [ext4]
May 14 14:04:05 edu02 kernel: [2375881.716820]  [<ffffffffa017942a>] ? ext4_journal_start_sb+0xd4/0x10e [ext4]
May 14 14:04:05 edu02 kernel: [2375881.716826]  [<ffffffff8130b4c2>] ? io_schedule+0x93/0xb7
May 14 14:04:05 edu02 kernel: [2375881.716833]  [<ffffffffa016016e>] ? ext4_dirty_inode+0x30/0x46 [ext4]
May 14 14:04:05 edu02 kernel: [2375881.716840]  [<ffffffff811148f1>] ? __mark_inode_dirty+0x25/0x14a
May 14 14:04:05 edu02 kernel: [2375881.716844]  [<ffffffff8110b958>] ? touch_atime+0x10a/0x133
May 14 14:04:05 edu02 kernel: [2375881.716850]  [<ffffffff810bf8f0>] ? generic_file_aio_read+0x499/0x536
May 14 14:04:05 edu02 kernel: [2375881.716855]  [<ffffffff810f9c45>] ? do_sync_read+0xce/0x113
May 14 14:04:05 edu02 kernel: [2375881.716858]  [<ffffffff810661e2>] ? autoremove_wake_function+0x0/0x2e
May 14 14:04:05 edu02 kernel: [2375881.716862]  [<ffffffff81105a05>] ? user_path_at+0x52/0x79
May 14 14:04:05 edu02 kernel: [2375881.716866]  [<ffffffff810fa668>] ? vfs_read+0xa6/0xff
May 14 14:04:05 edu02 kernel: [2375881.716869]  [<ffffffff810fa77d>] ? sys_read+0x45/0x6e
May 14 14:04:05 edu02 kernel: [2375881.716873]  [<ffffffff81010b42>] ? system_call_fastpath+0x16/0x1b

The stack trace leads me to think it is file system/journaling or disk related, but I cannot be sure. There will usually be other errors in dmesg when a drive is failing, and the S.M.A.R.T. status is fine for both drives in the server.

What would cause such hangs?

sune
  • 61
  • 2
  • 6

2 Answers2

0

Well, it's a warning, a debugging feature. Your kernel has CONFIG_DETECT_HUNG_TASK feature enabled. As per CONFIG_DETECT_HUNG_TASK description (took from kernel config option help section http://lxr.linux.no/linux+*/lib/Kconfig.debug#L264):

 "When a hung task is detected, the kernel will print the  current stack trace
 (which you should report), but the task will stay in uninterruptible state."

And, this is the reason. You had few process stuck for more than 120 sec on uninterrruptible state. If you want to clean it up, just do the followings:

          "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
rakib_
  • 136,911
  • 4
  • 20
  • 26
  • I'm happy about being warned, but wondering why the process was stuck for 120 seconds or more. I think there may be an underlying issue. Perhaps the process call trace is not enough info for debugging it, though? – sune May 15 '13 at 17:42
  • Call trace is a well known method for program debugging and developers are using this technique very efficiently. But, for kernel fixing kernel bugs/problems developers usually expects how to reproduce the problem along with call trace, that helps to fix the problem with comfort. – rakib_ May 15 '13 at 17:50
0

Upgrading to a newer kernel (Debian Wheezy) magically solved this problem.

sune
  • 61
  • 2
  • 6