Long version: I am running a Red Hat Enterprise Linux 5 (REHL5) machine with software raid1 (mdadm).
A few days ago I went to backup some MySQL data and all of sudden I could no longer log into the machine. I typed in a username to login and then it would just sit there. If a pressed control sequences they would appear on the screen but it would never log in. It also did not respond to ctrl+alt+delete. So I did a hard power down.
I booted it back up and monitored the raid1 array via:
mdadm --detail /dev/md1
This array holds the root mount point.
It began to do a resync of the array. I am not sure if this happened because of the crash or just because I did a hard power down. Either way I let it finish:
[f@mysqldatanode ~]# mdadm --detail /dev/md1
/dev/md1:
Version : 00.90.03
Creation Time : Thu Apr 19 15:28:52 2007
Raid Level : raid1
Array Size : 479893568 (457.66 GiB 491.41 GB)
Device Size : 479893568 (457.66 GiB 491.41 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 1
Persistence : Superblock is persistent
Update Time : Fri Dec 25 10:03:50 2009
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
UUID : ab4849de:1f4f41c4:defd01e8:a4979ca6
Events : 0.78
Number Major Minor RaidDevice State
0 8 2 0 active sync /dev/sda2
1 8 18 1 active sync /dev/sdb2
I looked through some logs (/var/log/messages*) and found several messages like the one below indicating hard-drive trouble:
Dec 21 11:39:47 localhost kernel: sd 0:0:1:0: SCSI error: return code = 0x08000002
Dec 21 11:39:47 localhost kernel: sdb: Current: sense key: Medium Error
Dec 21 11:39:47 localhost kernel: Additional sense: Unrecovered read error
Dec 21 11:39:47 localhost kernel: Info fld=0x3348912
Dec 21 11:39:47 localhost kernel: end_request: I/O error, dev sdb, sector 53774610
Dec 21 11:39:47 localhost kernel: raid1:md1: read error corrected (8 sectors at 53565760 on sdb2)
Dec 21 11:39:48 localhost kernel: raid1: sdb2: redirecting sector 53565648 to another mirror
So then I tried to look for badblocks and it locked up again in the same fashion.
[f@mysqldatanode ~]# badblocks -s /dev/md1
Checking for bad blocks (read-only test): 0/ 479893568
So how should I go about evaluating the health of the two drives? Since the array in question holds the root mount point do I need to move them to another machine to analyze them?