2

My physical server hangs randomly and I do not get any log entries during that time. So I let a script run:

while true; do date >>/tmp/dates; sleep 1; done & disown

After some minutes the file /tmp/dates has "holes". Sometimes the dates are interrupted for 30 seconds, sometimes for 3 minutes in a row:

Thu Feb 13 14:54:39 CET 2014
Thu Feb 13 14:54:40 CET 2014
Thu Feb 13 14:57:45 CET 2014
Thu Feb 13 14:57:46 CET 2014

The server does not show high load or memory utilization. /var/log/messages does not show anything for the times when the server hangs. However it does have some messages about IO problems like a fibrechannel link taking a nap. I am booting from SAN, so / is on a multimapper device. If I call

while true; do date >>/tmp/dates; sleep 1; done & disown
while true; do date >>/dev/shm/dates; sleep 1; done & disown

There will be no holes in /dev/shm/dates, but there will be holes in /tmp/dates so I think it is not a kernel hang situation, but an IO hang situation.

I can see no processes in uninterruptible sleep state in top. No entries in the hardware or storage log that seem relevant.

However iostat shows high latency spikes regarding await:

03/28/14 12:32:00
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-4              0.00     0.00    0.00    0.00     0.00     0.00     0.00   214.00    0.00   0.00 100.00

03/28/14 12:32:01
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-4              0.00     0.00    0.00    0.00     0.00     0.00     0.00   214.00    0.00   0.00 100.00

03/28/14 12:32:02
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-4              0.00     0.00    0.00    0.00     0.00     0.00     0.00   214.00    0.00   0.00 100.00

03/28/14 12:32:03
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-4              0.00     0.00    0.00 1219.00     0.00     4.76     8.00    45.61 5251.48   0.82 100.40

03/28/14 12:32:04
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-4              0.00     0.00    0.00    0.00     0.00     0.00     0.00     2.99    0.00   0.00  99.60

03/28/14 12:32:05
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-4              0.00     0.00    0.00    0.00     0.00     0.00     0.00     3.01    0.00   0.00 100.40

03/28/14 12:32:06
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-4              0.00     0.00    0.00    0.00     0.00     0.00     0.00     3.00    0.00   0.00 100.00

Could it be that the kernel is waiting for IO from the storage and does not do anything until the block arrives? What else could it be? How can I find out?

Thorsten Staerk
  • 387
  • 2
  • 11

1 Answers1

0

Looking at the FC-switch's error log we found one port with a huge error log. We disabled this port and have not seen the error for a whole weekend now. So I think this error has been caused by a broken fibrechannel path. Usually, to my information, such an error should be detected (e.g. by multipathd), logged and auto-corrected. So I assume we have one of the cases of an undetected fibrechannel path fault here.

Thorsten Staerk
  • 387
  • 2
  • 11