1

Recently have been getting Zabbix alerts about our mail system being unavailable, the uptime on the machine is 30+ days. I've been tracing the Zabbix logs and it looks like the communication between the Zabbix agent & server failed to respond in time which triggered the alert.

To find out if it was a network issue, etc. I viewed /var/log/messages and found the following entries:

Nov 14 21:48:49 iw kernel: INFO: task zabbix_agentd:3316 blocked for more than 120 seconds.
Nov 14 21:48:49 iw kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 14 21:48:49 iw kernel: zabbix_agentd D 0000000000000003     0  3316   3311 0x00000080
Nov 14 21:48:49 iw kernel: ffff880069075c50 0000000000000086 ffffffff817a8d69 ffff880069075c68
Nov 14 21:48:49 iw kernel: ffff880486ea3000 ffff880069075c58 ffffffff8127cb66 0000000000000009
Nov 14 21:48:49 iw kernel: ffff88042085bab8 ffff880069075fd8 000000000000fb88 ffff88042085bab8
Nov 14 21:48:49 iw kernel: Call Trace:
Nov 14 21:48:49 iw kernel: [<ffffffff8127cb66>] ? vsnprintf+0x2b6/0x5f0
Nov 14 21:48:49 iw kernel: [<ffffffff814ffec5>] rwsem_down_failed_common+0x95/0x1d0
Nov 14 21:48:49 iw kernel: [<ffffffff81500056>] rwsem_down_read_failed+0x26/0x30
Nov 14 21:48:49 iw kernel: [<ffffffff8127e664>] call_rwsem_down_read_failed+0x14/0x30
Nov 14 21:48:49 iw kernel: [<ffffffff814ff554>] ? down_read+0x24/0x30
Nov 14 21:48:49 iw kernel: [<ffffffff81140511>] __access_remote_vm+0x41/0x1f0
Nov 14 21:48:49 iw kernel: [<ffffffff81144052>] ? vma_merge+0x1d2/0x3e0
Nov 14 21:48:49 iw kernel: [<ffffffff8114071b>] access_process_vm+0x5b/0x80
Nov 14 21:48:49 iw kernel: [<ffffffff811e295d>] proc_pid_cmdline+0x6d/0x120
Nov 14 21:48:49 iw kernel: [<ffffffff8115c30a>] ? alloc_pages_current+0xaa/0x110
Nov 14 21:48:49 iw kernel: [<ffffffff811e357d>] proc_info_read+0xad/0xf0
Nov 14 21:48:49 iw kernel: [<ffffffff8117b9e5>] vfs_read+0xb5/0x1a0
Nov 14 21:48:49 iw kernel: [<ffffffff810d6b12>] ? audit_syscall_entry+0x272/0x2a0
Nov 14 21:48:49 iw kernel: [<ffffffff8117bb21>] sys_read+0x51/0x90
Nov 14 21:48:49 iw kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b

Kernel info:

Linux mail 2.6.32-279.2.1.el6.x86_64 #1 SMP Fri Jul 20 01:55:29 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

Memory info:

             total       used       free     shared    buffers     cached
Mem:         24031      21497       2533          0        606      14562
-/+ buffers/cache:       6328      17702
Swap:        31999         49      31950

I'm looking for some guidance on where to begin narrowing down the root cause of these issues.

bmurtagh
  • 773
  • 2
  • 6
  • 13
  • 1
    Physical or virtual? – ewwhite Nov 15 '12 at 18:14
  • Sorry, physical server – bmurtagh Nov 16 '12 at 14:18
  • Is this reproducible? E.g. can you force it to happen? What's the system load at the time you see the messages? – ewwhite Nov 16 '12 at 14:22
  • 1
    This is also happening in Ubuntu 11.04 where it blocks vmware-vmx during heavy IO periods. – boatcoder Jan 30 '13 at 15:39
  • 1
    I am reading some of the threads and people start facing this issue after upgrading to RHEL6.3(2.6.32-279),where they initially faced issue with processes writing to NFS device and later on with all processes.The first thing I would prefer is to downgrade the kernel or if that option is not the option there try to disable transparent huge pages which will resolve issue in some cases echo never >/sys/kernel/mm/redhat_transparent_hugepages/enabled.Please test in your test machine before implementing it in PRD env as I am really not sure that will fix your issue. – Prashant Lakhera Aug 27 '14 at 09:23

1 Answers1

2

Found this post, not sure if it applies to you or not. http://blog.ronnyegner-consulting.de/2011/10/13/info-task-blocked-for-more-than-120-seconds/

How much CPU do you have? Looks like you have quite a bit of memory (24GB). If the blog post is correct then you system may not be able to dump memory from the cache fast enough to deal with the IO you have coming it.

You can set "vm.dirty_ratio=10″ in /etc/sysctl.conf to force it to flush sooner. This may help your issue.

DMon
  • 21
  • 2