Failing disks in RAID array - strategy suggestions required

Question

I have a linux based software RAID 5 array. SMART has just started to send me emails complaining that one of the 5 disks has a Current Pending Sector Count of 9 and also an Offline Uncorrectable Count of 9. I have done a lot of google-ing and the consensus seems to be that if I write the sectors with zeros, the disk will remap them and all will be well.

I did want to track down which files were affected, but I have difficulty doing the mapping as I have 5 disks in RAID 5 with LUKS encryption on top, and finally LVM on top of that. None of the research I did helped me get through that tangle.

In the end, my plan was to simply fail the drive and re-add it to make the array re-build.

Before I did that, I did 'long' tests on the other disk in the array. All were perfect apart from one which had a Reallocated Sector count of 82,82,36,764.

So 2 out of 5 drives have issues.

At this point I am a little confused as to the best approach to trying to flush these errors out, if it is at all possible.

Does anyone have any advice? I am happy to replace failing drives where necessary, but would like to try to get the data straight first.

Sammitch is exactly right. Before you do anything, make sure your backups are up to date. — David Schwartz, Jan 18 '13 at 00:04

score 3 · Accepted Answer · answered Jan 17 '13 at 22:24

3

This will be the general process. See the mdraid man page and your own local configuration for the exact commands to use, if you don't already know them.

Pray.
Verify that your backup is current. Run it manually if necessary. If you don't have backups, make one now.
Fail the drive with pending sector and offline uncorrectable sectors. The other drive with reallocated sectors will live a little longer, and hopefully long enough to complete this process, but this drive is at the point where it could kill your entire array.
Replace the drive. In hardware. Partition the new drive and add it to the mdraid array.
Rebuild the array and wait for the rebuild to complete. In newer versions of mdraid, the rebuild will start automatically.
Repeat the process with the second drive.

answered Jan 17 '13 at 22:24

Michael Hampton

244,070
43
506
972

I was hoping that the disk I was first worried about (with pending sector and offline uncorrectable sectors) might be fixable by re-adding to the array, and allowing the sectors to be re-mapped. I was actually more concerned with the second disk (Reallocated Sector count = x). Was this looking at the issues the wrong way round in terms of seriousness? – Tony Rogers Jan 17 '13 at 22:37
+1 Make a burnt offering to Charles Babbage – Bigbio2002 Jan 17 '13 at 22:38
The problem is that the drive with pending sectors will stall out during reading of one or more of them, perhaps causing the array to think it's failed prematurely. This could happen at any moment. The other disk doesn't seem to be having problems reallocating sectors, which is why it will likely survive a bit longer (where a bit could be minutes or months). – Michael Hampton Jan 17 '13 at 22:39
Ok That makes sense. I wasn't quite sure which was going to take things out first at a random and inconvenient time. – Tony Rogers Jan 17 '13 at 22:50
Just to let you know, I successfully copied the contents of the array on to Blue-ray disks and the replaced the failing disks one at a time, with no problems. Many thanks for your help. – Tony Rogers Jan 27 '13 at 17:37

score 0 · Answer 2 · edited Sep 29 '16 at 12:11

0

You can force the check and repair array with command (as root). Modify to your needs (insert you name of array):

echo repair > /sys/block/md0/md/sync_action

Of course you really need to create a backup of the data before you start. And you should consider replacing the damaged hdd for a new one.

You can copy a partition schema from disk to disk with command like

sfdisk -d /dev/sda | sfdisk /dev/sdb

Of course double check names of disks before executing that. You don't want to erase a partition on a good disk.

Adding a partition/disk to an array is described in manual of mdadm. Good Luck.

edited Sep 29 '16 at 12:11

pacey

3,833
1
16
31

answered Sep 29 '16 at 11:47

Tomasz Szczypel

31
3

Thank you pacey. I need more practice with English. I also need to better understand how to format text. – Tomasz Szczypel Sep 29 '16 at 12:20

Failing disks in RAID array - strategy suggestions required

2 Answers2