2

I have a historically flaky MSA1500 with 2xMSA20 enclosures attached. Recently a disk failed in one of the enclosures.

The LCD display reports that interim state recovery for all volumes was successful.

On hot-swapping the failed drive, one of the volumes failed to rebuild with error:

112 VOLUME #0 REBUILD FAILURE

The other 5 volumes have successfully rebuilt.

According to the manual here:

http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?objectID=c00282726

"When the volume is still operating in regenerative mode, remove the new SCSI drive that was added as a replacement for the original failed drive and replace it with a different new drive."

How can I ascertain that the system is in regenerative mode before replacing the replacement disk? Is the fact that the interim state recovery is still effectively in operation?

peterh
  • 4,953
  • 13
  • 30
  • 44
  • You don't mention what size disk arrays you're talking about, but we had a PERC RAID 5 fail with a bad drive. Stuck in a new drive, and it failed the rebuild with one of the disks that was showing as good, so it couldn't repair. Ran chkdsk (didn't matter) and ran the on-controller disk repair (kept "fixing" the issue repeatedly after several hours only to reboot and spew the same rebuild fail error); ended up replacing that disk too and having to restore from bare metal backup. Unrecoverable Disk Errors are becoming more commonplace with bigger drives. – Bart Silverstrim Nov 09 '09 at 12:46
  • There are 12x500GB drives in each enclosure, in an 8TB/4TB LVM/EXT3 arrangement. Unfortunately I did not get to configure the RAID so there are no hot-spares (thanks vendor!), if we get a 2 disk failure we're back to the tape backups, and these have never been tested in anger. I daren't bring the system up until the new drive is in, so on controller/CLI repairs are all I am willing to attempt. –  Nov 09 '09 at 17:06

1 Answers1

1

OK so this was one of those 'edge case' issues I feel.

On closer inspection with an HP engineer it was determined that one of the other drives in the RAID was unrecognised. Interestingly it is one that we have had replaced before.

As the LUNS are configured in ADG, we can tolerate 2 drive failures. The failure of the rebuild was due to the other damaged drive.

The procedure from HP (as we have no hot spares configured - thanks to the vendors of the kit...) was to remove the newly identified faulty drive, let the rebuild complete, then insert a new drive into the faulty drive bay, let the rebuild complete again.