can high load cause server hang and error "blocked for more than 120 seconds"?

Question

Currently running a few VM's and 'baremetal' servers. Java is running on high - over 400%+ at times. Randomly the server hangs with the error in the console "java - blocked for more than 120 seconds" - kjournald, etc.

I cannot get a dmesg output because for some reason this error only writes to the console, which I don't have access to since this is remotely hosted. therefore I cannot copy a full trace.

I changed the environment this is on - even physical server and it's still happening.

I changed hung_task_timeout_secs to 0 incase this is a false positive as per http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Technical_Notes/deployment.html .

Also, irqbalance is not installed, perhaps it would help?

this is Ubuntu 10.04 64bit - same issue with latest 2.6.38-15-server and 2.6.36 .

could cpu or memory issues/no swap left cause this issue?

here is the console message:

[58Z?Z1.5?Z840] INFUI task java:21547 blocked for more than 120 seconds.
[58Z?Z1.5?Z986] "echo 0 > /proc/sgs/kernel/hung_task_timeout_secs" disables this
message.
[58Z841.5?Z06Z] INFUI task kjournald:190 blocked for more than 120 seconds.
[58Z841.5?Z336] "echo 0 > /proc/sgs/kernel/hung_task_timeout_secs" disables this
message.
[58Z841.5?Z600] INFUI task flush-202:0:709 blocked for more than 120 seconds.
[58Z841.5?Z90?] "echo 0 > /proc/sgs/kernel/hung_task_timeout_secs" disables this
message.
[58Z841.5?3413] INFUI task java:21547 blocked for more than 120 seconds.
[58Z841.5?368Z] "echo 0 > /proc/sgs/kernel/hung_task_timeout_secs" disables this
message.
[58Z961.5?ZZ36] INFUI task kjournald:60 blocked for more than 120 seconds.
[58Z961.5?Z6Z5] "echo 0 > /proc/sgs/kernel/hung_task_timeout_secs" disables this
message.
[58Z961.5?31ZZ] INFUI task flush-202:0:709 blocked for more than 120 seconds.
[58Z961.5?3393] "echo 0 > /proc/sgs/kernel/hung_task_timeout_secs" disables this
message.

Pierre Carrier · Answer 1 · 2012-07-05T21:48:49.183

18

Yes, it could.

What this means is fairly explicit: the kernel couldn't schedule the task for 120 seconds. This indicates resource starvation, often around disk access.

irqbalance might help, but that doesn't sound obvious. Can you provide us with the surrounding of this message in dmesg, in particular the stack trace that follows it?

Moreover, this is not a false positive. This does not say that the task is hung forever, and the statement is perfectly correct. That doesn't mean it's a problem for you, and you can decide to ignore it if you don't notice any user impact.

This cannot be caused by:

a CPU issue (or rather, that would be an insanely improbable hardware failure),
a memory issue (very improbably a hardware failure, but wouldn't happen multiple times; not a lack of RAM as a process would be oom-killed),
a lack of swap (oom-killer again).

To an extend, you might be able blame this on a lack of memory in the sense that depriving your system of data caching in RAM will cause more I/O. But it's not as straightforward as "running out of memory".

edited Jul 05 '12 at 21:48

answered Jul 05 '12 at 21:43

Pierre Carrier

2,617
18
28

There is nothing being recorded to /var/log/dmesg so I just pasted what the Console showed.. when this appears the system is 100% hung. – Tee Jul 05 '12 at 21:51
This message comes from the kernel, it will appear in `dmesg` (if it was logged recently enough) as this command prints the kernel logging ring buffer. Hopefully your `syslog` setup will also log it somewhere in `/var/log`, but I couldn't know where. – Pierre Carrier Jul 05 '12 at 22:19
The message will *NOT* appear in `/var/log/dmesg`, but *may* turn up when you run the `dmesg` command. The file is created during the boot process and generally only captures boot-time kernel messages (which would otherwise eventually scroll out of the kernel ring buffer. You could also install/enable `sysstat` and look at resource utilization as reported there. I'm suspecting disk I/O / iowait, likely related to swapping (sysstat will help in identifying this). – Dr. Edward Morbius Jul 06 '12 at 19:23
@Dr.EdwardMorbius So how do we fix this? I'm having a major issue related to this with our Zimbra server which was running great in a production environment until recently. – Lopsided Mar 21 '14 at 14:45
@Lopsided: Sorry for the delay, I'm not here often. Briefly: you'll have to profile your Java process and find out why it's hanging. Garbage collection is one area I've had issues (and successes) in tuning. Look up JVM garbage collection ergodymics and see http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html I found increasing heap helped markedly. – Dr. Edward Morbius Apr 26 '14 at 10:40

score 6 · Answer 2 · edited Apr 21 '19 at 19:32

6

sudo sysctl -w vm.dirty_ratio=10
sudo sysctl -w vm.dirty_background_ratio=5

Then commit the change with:

sudo sysctl -p

solved it for me....

edited Apr 21 '19 at 19:32

Glutanimate

103
4

answered Feb 21 '16 at 11:48

Nick

69
1
1

12

You should explain what each those settings do. – kasperd Feb 21 '16 at 16:36
8

This fixed a similar issue I was having in a docker environment. I found an explanation here: http://www.blackmoreops.com/2014/09/22/linux-kernel-panic-issue-fix-hung_task_timeout_secs-blocked-120-seconds-problem/. "By default Linux uses up to 40% of the available memory for file system caching. After this mark has been reached the file system flushes all outstanding data to disk causing all following IOs going synchronous. For flushing out this data to disk this there is a time limit of 120 seconds by default. In the case here the IO subsystem is not fast enough to flush the data withing..." – Peter M Feb 29 '16 at 16:35
`sysctl -p` does not commit the change; it causes the `/etc/sysctl*` files to be read and loaded into the kernel. `sysctl -w` is an instant commit; it doesn't need anything else. If you want these new values to be permanent across reboots, you have to edit one of the `/etc/sysctl*` files to add the change there. – Ian D. Allen May 16 '23 at 01:58
What worked for me was going right down to zero: `sysctl -w vm.dirty_ratio=0` Anything greater than zero still caused large delays in disik-related commands such as `hdparm`, `sync`, etc. – Ian D. Allen May 16 '23 at 01:59
Unfortunately, an unacceptable side-effect of "sysctl -w vm.dirty_ratio=0" is that disk operations that move a lot of data are taking much too long. – Ian D. Allen May 16 '23 at 14:15

score 3 · Answer 3 · answered Nov 12 '15 at 15:27

I recently went through this error in one of our Production clusters:

Nov 11 14:56:41 xxx kernel: INFO: task xfsalloc/3:2393 blocked for more than 120 seconds.

Nov 11 14:56:41 Xxxx kernel: Not tainted 2.6.32-504.8.1.el6.x86_64 #1

Nov 11 14:56:41 xxx: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

..

On further verification of the sar logs Found the IO wait was increased during the same time.

And upon checking the Hardware (Physical Disks) saw medium errors and other SCSI Errors had logged on one the Physical Disks, which in turn was blocking the IOs, due to lack of resources to allocate.

11/11/15 19:52:40: terminatated pRdm 607b8000 flags=0 TimeOutC=0 RetryC=0 Request c1173100 Reply 60e06040 iocStatus 0048 retryC 0 devId:3 devFlags=f1482005 iocLogInfo:31140000

11/11/15 19:52:40: DM_ProcessDevWaitQueue: Task mgmt in process devId=x 11/11/15 19:52:40: DM_ProcessDevWaitQueue: Task mgmt in process devId=x

So this was due to hardware error, in our cluster.

So it would be good, if you could check for core file and also if ipmi utility is there, check for ipmiutil/ipmitool sel elist command to check for the issue.

Regards, VT

score 0 · Answer 4 · answered Sep 26 '19 at 11:00

You could go to the monitoring interface of your cloud provider and check if you didn't exceeded the maximum IOps specified for your storage, that would explain why it took long time to flush the cache data.
The maximum IOps is available on your storage attributes page.

can high load cause server hang and error "blocked for more than 120 seconds"?

4 Answers4

Linked