2

Long version: I am running a Red Hat Enterprise Linux 5 (REHL5) machine with software raid1 (mdadm).

A few days ago I went to backup some MySQL data and all of sudden I could no longer log into the machine. I typed in a username to login and then it would just sit there. If a pressed control sequences they would appear on the screen but it would never log in. It also did not respond to ctrl+alt+delete. So I did a hard power down.

I booted it back up and monitored the raid1 array via:

mdadm --detail /dev/md1

This array holds the root mount point.

It began to do a resync of the array. I am not sure if this happened because of the crash or just because I did a hard power down. Either way I let it finish:

[f@mysqldatanode ~]# mdadm --detail /dev/md1
/dev/md1:
        Version : 00.90.03
  Creation Time : Thu Apr 19 15:28:52 2007
     Raid Level : raid1
     Array Size : 479893568 (457.66 GiB 491.41 GB)
    Device Size : 479893568 (457.66 GiB 491.41 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 1
    Persistence : Superblock is persistent

    Update Time : Fri Dec 25 10:03:50 2009
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           UUID : ab4849de:1f4f41c4:defd01e8:a4979ca6
         Events : 0.78

    Number   Major   Minor   RaidDevice State
       0       8        2        0      active sync   /dev/sda2
       1       8       18        1      active sync   /dev/sdb2

I looked through some logs (/var/log/messages*) and found several messages like the one below indicating hard-drive trouble:

Dec 21 11:39:47 localhost kernel: sd 0:0:1:0: SCSI error: return code = 0x08000002
Dec 21 11:39:47 localhost kernel: sdb: Current: sense key: Medium Error
Dec 21 11:39:47 localhost kernel:     Additional sense: Unrecovered read error
Dec 21 11:39:47 localhost kernel: Info fld=0x3348912
Dec 21 11:39:47 localhost kernel: end_request: I/O error, dev sdb, sector 53774610
Dec 21 11:39:47 localhost kernel: raid1:md1: read error corrected (8 sectors at 53565760 on sdb2)
Dec 21 11:39:48 localhost kernel: raid1: sdb2: redirecting sector 53565648 to another mirror

So then I tried to look for badblocks and it locked up again in the same fashion.

[f@mysqldatanode ~]# badblocks -s /dev/md1
Checking for bad blocks (read-only test):               0/      479893568

So how should I go about evaluating the health of the two drives? Since the array in question holds the root mount point do I need to move them to another machine to analyze them?

fredrick
  • 155
  • 1
  • 6

2 Answers2

11

You can fail the /dev/sdb device through mdadm (best make sure you fail the entire device i.e. all mds that runs off it) then check it for errors, but from what you are describing you are most likely better off just replacing the device.

I have had ide devices that failed on a regular basis, I kept re-adding the rejected device until finally the computer started hanging like you describe. Replacing the failing device solved the problem.

In either case you should make a backup as soon as possible.

Rune Nilssen
  • 315
  • 3
  • 11
  • 3
    +1 for "just replace the drive". Disks cost about 5 minutes worth of sysadmin time these days. – womble Dec 28 '09 at 23:37
  • Oh ok. Would you backup before you fail the bad drive? – fredrick Dec 29 '09 at 00:37
  • I would back up immediately :-) I would not try to fail the drive until I get as much data off the raid as possible. Simply because you cannot really know for sure if it is the one drive or the other. If you are having problems copying data of the md Id suggest you copy data off the drive in chunks to get as much backed up as possible, and as long as you have the opportuinity buy into a commercial backupservice so that you store offsite backups fresh each night. :-) – Rune Nilssen Dec 29 '09 at 00:40
  • Well I failed the entire device, replaced the hard drive, and then followed this post for recovery: http://serverfault.com/questions/97565/raid1-how-do-i-fail-a-drive-thats-marked-as-removed Personally I skipped steps #2 and 3. I made it through the whole badblocks check this time so it looks like it's working again. Thanks! – fredrick Dec 30 '09 at 16:49
  • One other thing, I skipped step 3 because I went into grub console and typed find /grub/stage/1. It founded it on the surviving drive. After reconstruction I ran the same find command and it found it on both drives so I'm assuming I will be able too boot off both drives (I guess I should probably test this). – fredrick Dec 30 '09 at 16:57
  • 1
    Sounds like you made it through - you should also consider in the future converting to pure hardware raid, its so insanely convenient just to hotswap the drive and forget about it :) – Rune Nilssen Dec 31 '09 at 18:32
3

Read errors are common, but the disks correct most of them by themselves. Some disks lie and report good reads in the SMART info, and some report the correct number of errors and number of ECC recovered. Some disks (perpendicular in particular) can have millions of read errors and 99.99999% (or more) ECC recovered.

However, this time /dev/sdb2 failed to correctly read 8 sectors.

The softraid then simply recovered by fetching the missing sectors from the other disk and rewriting them. Then it decided that everything is fine again.

This COULD be a sign of a bad drive, but it could also be a once-in-a-mtbf error, a stray dust particle or whatever. Wait and see if more errors pops up before you scrap this drive.