I am running Ubuntu 12.04 LTS. Yesterday I found a message in my mailbox saying that my server was shut down. I proceeded to reboot the system, but it didn't come up after many minutes, and I didn't have a hardware KVM system to see what the kernel was printing to the terminal. So I rebooted the system to a Linux rescue image and I saw that the software RAID 1 array was out of sync. The rescue system also began to reconstruct the RAID array.
So far there is no evidence that any of the disks have hardware errors. SMART statuses look good so far.
I never received an email notification by mdadm, even though email notification was turned on in /etc/mdadm/mdadm.conf.
This server was also configured to forward all syslog messages to a log host, so I checked my log host. The relevant parts are:
May 20 15:38:40 kernel: [ 1.869825] md0: detected capacity change from 0 to 536858624 May 20 15:38:40 kernel: [ 1.870687] md0: unknown partition table May 20 15:38:40 kernel: [ 1.877412] md: bind May 20 15:38:40 kernel: [ 1.878337] md/raid1:md1: not clean -- starting background reconstruction May 20 15:38:40 kernel: [ 1.878376] md/raid1:md1: active with 2 out of 2 mirrors May 20 15:38:40 kernel: [ 1.878418] md1: detected capacity change from 0 to 3000052808704 May 20 15:38:40 kernel: [ 1.878575] md: resync of RAID array md1 [snip] May 20 15:52:33 kernel: Kernel logging (proc) stopped. May 20 15:52:33 rsyslogd: [origin software="rsyslogd" swVersion="5.8.6" x-pid="845" x-info="http://www.rsyslog.com"] exiting on signal 15.
As you can see, the system (the normal one, not the rescue system) already detected that something was wrong with the RAID array during a system boot. Then, shortly after, something (not me) halted the system.
So my questions are:
- What could cause the disks to suddenly become out of sync?
- Why was I not notified by email?
- Why was the error not properly logged to syslog before halting the system? Could it be that the system tried to log to syslog, but did so after stopping the syslog daemon? If so what can I do to prevent that?
- What can I do to find out what happened? Or, if there's no way for me now to find out what happened, how can I improve logging and notifications so that next time I can do a better post-mortem?
My question is not about proper backup practice. I already know that RAID is not a backup etc. My question is solely about notifications and diagnosis.