0

We have three servers running on a same ESX host, all virtual disks are from a remote SAN storage controller. These tree servers hanged and restarted several days ago, and it happened to the DB server today once more. The weird thing is there is not any panic log, crash log, error log when the problem occurred.


Server1. Web Server FreeBSD Meduna 8.1-RELEASE-p2 FreeBSD 8.1-RELEASE-p2 #2: Mon Feb 14 12:57:36 MYT 2011 hailang@Meduna:/usr/obj/usr/src/sys/Meduna amd64

Meduna# cat /var/log/messages | grep panic

Meduna# bzcat /var/log/messages.?.bz2 | grep panic

Meduna# cat /var/log/messages | grep error

Meduna# bzcat /var/log/messages.?.bz2 | grep error

May 28 16:05:04 Meduna kernel: /var: mount pending error: blocks 4 files 1


Server2. DB Server FreeBSD Moncalvo 8.1-RELEASE-p2 FreeBSD 8.1-RELEASE-p2 #1: Mon Jan 10 13:02:48 MYT 2011 hailang@Moncalve:/usr/obj/usr/src/sys/Moncalve amd64

Moncalvo# cat /var/log/messages | grep panic

Moncalvo# cat /var/log/messages | grep panic

Moncalvo# bzcat /var/log/messages.?.bz2 | grep panic

Moncalvo# cat /var/log/messages | grep error

Moncalvo# bzcat /var/log/messages.?.bz2 | grep error

May 28 16:17:17 Moncalvo kernel: /var: mount pending error: blocks -32 files 0


Server3. Not_In_Use FreeBSD Mecure 8.1-RELEASE-p2 FreeBSD 8.1-RELEASE-p2 #0: Fri Feb 11 14:45:55 MYT 2011 hailang@ServerX:/usr/obj/usr/src/sys/Mecure amd64

Mecure# cat /var/log/messages | grep panic

Mecure# bzcat /var/log/messages.?.bz2 | grep panic

Mecure# bzcat /var/log/messages.?.bz2 | grep error

Mecure# cat /var/log/messages | grep error

May 28 15:42:41 Mecure kernel: g_vfs_done():da0s1d[WRITE(offset=3275046912, length=16384)]error = 5

May 28 15:42:41 Mecure kernel: g_vfs_done():da0s1d[READ(offset=4062199808, length=16384)]error = 5

May 28 15:42:41 Mecure kernel: g_vfs_done():da0s1d[WRITE(offset=3281371136, length=10240)]error = 5


This is how /var/log/messages looks like when the problem occurs


May 28 13:06:26 Meduna kernel: icmp redirect from 10.16.10.250: 113.23.142.94 => 10.16.10.18

May 28 13:07:01 Meduna kernel: icmp redirect from 10.16.10.250: 202.186.13.232 => 10.16.10.18

May 28 13:15:00 Meduna kernel: icmp redirect from 10.16.10.250: 113.23.142.94 => 10.16.10.18

May 28 13:15:35 Meduna kernel: icmp redirect from 10.16.10.250: 202.186.13.232 => 10.16.10.18

May 28 13:41:36 Meduna syslogd: kernel boot file is /boot/kernel/kernel

May 28 13:41:36 Meduna kernel: Copyright (c) 1992-2010 The FreeBSD Project.

May 28 13:41:36 Meduna kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994

[!]It just hanged for about half an hour and restarted without any error.

May 28 13:13:14 Moncalvo kernel: icmp redirect from 10.16.10.250: 60.49.152.98 => 10.16.10.18

May 28 13:14:25 Moncalvo kernel: icmp redirect from 10.16.10.250: 210.48.150.200 => 10.16.10.18

May 28 13:16:58 Moncalvo kernel: icmp redirect from 10.16.10.250: 183.78.169.57 => 10.16.10.18

May 28 15:59:06 Moncalvo syslogd: kernel boot file is /boot/kernel/kernel

May 28 15:59:06 Moncalvo kernel: Copyright (c) 1992-2010 The FreeBSD Project.

May 28 15:59:06 Moncalvo kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994

[!]And this server hanged for more than 2 hours to restart


I suspect that this might be a storage problem but without any prove for that. Could you please give me some advise to solve/dig the issue. Any help is highly appreciated!

Best Regards,

Hai Lang

bestwc
  • 21
  • 1
  • 3
  • 2
    I've seen the g_vfs_done() error on my FreeBSD boxes when the SAN network connection was broken briefly. – barryj Jun 01 '11 at 09:07
  • You are absolutely right that this is almost certain a host side problem after I see this thread:http://unix.derkeiler.com/Mailing-Lists/FreeBSD/questions/2009-12/msg00815.html – bestwc Jun 01 '11 at 09:43
  • Anyone has any idea if I want to accuse VMWare guys and SAN guys, what to look for? – bestwc Jun 01 '11 at 09:43
  • 2
    @bestwc, in my case we had a Dell/EMC SAN as in the post you referenced. Can't say it was the same model, but we had an AX-45i, and saw that happen if we rebooted one switch or one storage processor - seems like the fail over to the other switch/processor took longer than freebsd was happy with - never bothered any windows VMs - only the FreeBSD ones - and then only one or two, possibly ones that had a moderate disk load at the time. Have since changed to an Equallogic SAN and haven't seen the issue. – barryj Jun 01 '11 at 10:20
  • @barryj Thanks for sharing your experiences. And I want to make sure is there any tuning or configuration on FreeBSD to make it tolerant the failover thing? Because the exact cause of the problem is when the failover happens. – bestwc Jun 02 '11 at 05:56
  • @bestwc I've no ideas on the tuning, I didn't do anything about it, and we've moved to a different SAN now. I haven't seen the issue on the new SAN, though I've no idea if that's the reason as it's only been installed in the last couple of months. – barryj Jun 02 '11 at 06:41
  • It's almost certainly a SAN (or related) issue. Setup a syslog server and have the boxen ship messages over to it by adding `*.* @10.10.10.10` (change IP addy as appropriate) to `/etc/syslog.conf` and restart syslogd with `/etc/rc.d/syslogd restart`. You'll probably get some 'lost device' (or similar) entry right before the freeze up. – Chris S Jun 28 '11 at 14:10

1 Answers1

1

Problem most probably cased by SAN malfunction. When FreeBSD looses disk there almost no way of leaving panic log entry. But in VM environment (and also in very few motherboards) there can be msgbuf (dmesg) left after reboot. You may try to examine it.

For debug you can try using DDB instead of reboot after panic.

PS. If you have system programmer at hand you can ask him to write something like Linux's netconsole for FreeBSD

SaveTheRbtz
  • 5,691
  • 4
  • 32
  • 45