Simultaneous server crash in fail over cluster

Question

I have 2 servers in a failover cluster. The cluster defines a shared 'ClusterStorage' drive. The drive maps to a SAN device through iScsi.

Recently, the 2 servers rebooted on their own at the same time. The errors in the event log of the server and for the cluster indicated that the servers could not access/write to the shared drive. Each server has access to the SAN through 2 separate network paths on 2 different subnets using 2 network cards. The SAN has 2 controllers. The event log on the SAN does not report any error corresponding to this event. Additionally, the database server, which also uses the SAN (through a SQL role defined on the cluster and a dedicated drive), did not report any IO errors.

This seems to indicate that the SAN was fine and reachable. Yet, the 2 servers rebooted on their own, defeating the point of having redundancy through a cluster.

Cluster events -- MAPLE rebooted

Administrative event log on MAPLE

System even log on MAPLE

Any idea on the actual cause for this reboot?

`Any idea on the actual cause for this reboot?` - Um, no. Why would we have any idea? We don't have access to the servers. We can only go by what you've told us, which isn't exactly the most detailed description. How about posting the details of the event log entries you referred to in your question? Also, when you say `The cluster defines a shared 'ClusterStorage' drive` do you mean that you're using a Cluster Shared Volume? — joeqwerty, Jul 31 '15 at 14:10
@joeqwerty 'do you mean that you're using a Cluster Shared Volume' : yes; Relevant events on the cluster and one of the servers added. — HashPsi, Jul 31 '15 at 14:47
OK, it's doubtful that was the cause of the reboot. Have you looked at the System event log on the servers? — joeqwerty, Jul 31 '15 at 14:49
@joeqwerty Yes. It has essentially the same entries. See added screen shot. — HashPsi, Jul 31 '15 at 14:53
Again, there's no indication that the error is the cause of the reboot and it's doubtful that it is. I simulated a failure of my iSCSI storage array in my FC and didn't see any reboots of the hosts. There's another reason the hosts rebooted. — joeqwerty, Jul 31 '15 at 17:36
@joeqwerty Thanks for running this test. There are no other errors around that time that I can find so I am not sure how to investigate further. This setup has worked for 2+ years without any problem. I am just concerned that it might happen again. This kind of downtime is not exactly to the liking of the users. — HashPsi, Jul 31 '15 at 17:47
You're focusing on errors but maybe the reboots were user initiated or initiated by Windows Updates. I might suggest planning an after hours simulated failure of your storage to see if it results in a reboot in order to rule out that error as evidence of the cause of the reboot. — joeqwerty, Jul 31 '15 at 17:59
@joeqwerty Thanks for the suggestion. I have ruled out windows update as the cause. Automatic install of updates is not enabled on either server. Only one other person has the admin privileges to reboot the servers. The servers are in a locked cabinet in a data center. I will try your suggestion to simulate a failure to see if I can reproduce this condition. — HashPsi, Jul 31 '15 at 18:19
I would simulate the failure using multiple methods (assuming you have the necessary "protection" mechanisms and backups in place), FC simulated failure, pull the network connections, pull power to the storage array, etc. For safety's sake, shut down all of your VM's before testing. — joeqwerty, Jul 31 '15 at 19:09

Simultaneous server crash in fail over cluster

0 Answers0