server hangs randomly without log entries

Question

My physical server hangs randomly and I do not get any log entries during that time. So I let a script run:

while true; do date >>/tmp/dates; sleep 1; done & disown

After some minutes the file /tmp/dates has "holes". Sometimes the dates are interrupted for 30 seconds, sometimes for 3 minutes in a row:

Thu Feb 13 14:54:39 CET 2014
Thu Feb 13 14:54:40 CET 2014
Thu Feb 13 14:57:45 CET 2014
Thu Feb 13 14:57:46 CET 2014

The server does not show high load or memory utilization. /var/log/messages does not show anything for the times when the server hangs. However it does have some messages about IO problems like a fibrechannel link taking a nap. I am booting from SAN, so / is on a multimapper device. If I call

while true; do date >>/tmp/dates; sleep 1; done & disown
while true; do date >>/dev/shm/dates; sleep 1; done & disown

There will be no holes in /dev/shm/dates, but there will be holes in /tmp/dates so I think it is not a kernel hang situation, but an IO hang situation.

I can see no processes in uninterruptible sleep state in top. No entries in the hardware or storage log that seem relevant.

However iostat shows high latency spikes regarding await:

03/28/14 12:32:00
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-4              0.00     0.00    0.00    0.00     0.00     0.00     0.00   214.00    0.00   0.00 100.00

03/28/14 12:32:01
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-4              0.00     0.00    0.00    0.00     0.00     0.00     0.00   214.00    0.00   0.00 100.00

03/28/14 12:32:02
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-4              0.00     0.00    0.00    0.00     0.00     0.00     0.00   214.00    0.00   0.00 100.00

03/28/14 12:32:03
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-4              0.00     0.00    0.00 1219.00     0.00     4.76     8.00    45.61 5251.48   0.82 100.40

03/28/14 12:32:04
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-4              0.00     0.00    0.00    0.00     0.00     0.00     0.00     2.99    0.00   0.00  99.60

03/28/14 12:32:05
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-4              0.00     0.00    0.00    0.00     0.00     0.00     0.00     3.01    0.00   0.00 100.40

03/28/14 12:32:06
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-4              0.00     0.00    0.00    0.00     0.00     0.00     0.00     3.00    0.00   0.00 100.00

Could it be that the kernel is waiting for IO from the storage and does not do anything until the block arrives? What else could it be? How can I find out?

Do you have steal time in top? Which kernel version do you use? Do you use plain Linux kernel or sg. like Xen modified kernel? — David Lakatos, Feb 13 '14 at 22:51
real, non-virtualized host. I got a problem the same problem in a VM so I decided to do a bare-metal installation. — Thorsten Staerk, Feb 14 '14 at 06:13
Linux vcehanar15 3.0.101-0.7.15-default #1 SMP Wed Jan 15 22:13:35 UTC 2014 (0e621ee) x86_64 x86_64 x86_64 GNU/Linux — Thorsten Staerk, Feb 14 '14 at 09:29
I don't understand why the answer to this isn't install monitoring and gather evidence. Everything else is just guessing. — user9517, Feb 14 '14 at 11:51
Echo not just time but iostat, top, vmstat and few others to shm and see what is happening during the hangs. — Stone, Feb 23 '14 at 09:34

score 0 · Accepted Answer · answered Mar 31 '14 at 09:33

Looking at the FC-switch's error log we found one port with a huge error log. We disabled this port and have not seen the error for a whole weekend now. So I think this error has been caused by a broken fibrechannel path. Usually, to my information, such an error should be detected (e.g. by multipathd), logged and auto-corrected. So I assume we have one of the cases of an undetected fibrechannel path fault here.

server hangs randomly without log entries

1 Answers1