0

In a previous message I have asked how to rebuild a faulty disk in a RAID 5 array with 4 disks. I have mounted a new drive (drive 4) in place of the faulty one and started a rebuild. During the rebuild, another disk (drive 2) started throwing ECC errors and timeouts. AT 95% of the rebuild process, the computer rebooted and hang at the start screen, with the controller (3ware 9500s) showing an error (drive 2 not found) and a typical noise coming from the faulty drive (drive 2), could be heard. I have turned off & on the PC few times, no changes. Then I have left the PC off for an hour. Turned on again, his time the missing drive (drive 2) was back in place. I could bot the operating system awaiting for the rebuild, started automatically from the controller. At a certain point, the controller started gave a rebuild error and halted the rebuild process. The server is now running with drive 2 with errors and drive 4 with a OK status, but degraded as the rebuild process could not complete. It looks like I'm at a dead end: at least 3 drives need to be ok to make things good, however one drive has errors and one drive is not rebuild.. What can I try?

peterh
  • 4,953
  • 13
  • 30
  • 44
Riccardo
  • 253
  • 1
  • 3
  • 13

4 Answers4

3

Your best bet is to restore from backups. But I'm guessing you don't have those, or you wouldn't be asking the question.

So, failing backups, your next best bet is to copy as much of the data off as possible (from the sounds of things you'll have at least a couple unreadable sectors that won't be copyable) with whatever method you favor - file copy, disk image, disk-level copy, etc. Then once you have your data, you can replace the faulty drives, create a new RAID array and copy your data back.

Failing that, you can go through the expensive process of professional data recovery or just accepting your data loss and moving on, depending on how much your data is worth to you.

HopelessN00b
  • 53,795
  • 33
  • 135
  • 209
  • I do have backups, I just wanted to avoid re-installing everything, as the PC is rather old; will need a new PC, a new OS, a newer SQL server version....3/6 days downtime....... – Riccardo Mar 06 '13 at 20:52
1

The easiest thing would be to restore from backup. But you're probably asking this question because you don't have one. In that case you are going to call a disk drive recovery center and see what they can do for you.

When you finally get this rebuilt you'll learn the real value of a backup system that works.

toppledwagon
  • 4,245
  • 25
  • 15
0

Can you show the output of twcli /c0 show all?

If drive 2 is in ECC-ERROR state, you can possibly continue the rebuild by telling the controller to ignore the ECC errors on drive 2.

@Sergey Vasilov's answer in this thread What does 3Ware's tw_cli mean by a "DEGRADED" disk vs "ECC-ERROR"? has the right information. (I used to know this offhand, but had to look up the commands, and Sergey's answer had the first hit in a google search so I'll give him the credit). Because it's always better to actually quote the answer:

/cx/ux start rebuild disk=p [ignoreECC]
/cx/ux set ignoreECC=on|off

Even if this lets you rebuild the array, you may still have filesystem corruption, or dataloss. Or you may not.

Daniel Lawson
  • 5,476
  • 22
  • 27
  • Assuming I could force the rebuild ignoring errors, would there be a way to know where the failure occurred? – Riccardo Mar 06 '13 at 20:59
  • I don't have any ideas for you finding out which blocks were affected, although I suspect there is a way. I'd definitely start by running a fsck on the filesystems though, as that is very likely to catch most of the errors that result. – Daniel Lawson Mar 06 '13 at 21:28
  • Thanks Daniel. How about the way you suspect? – Riccardo Mar 06 '13 at 21:50
  • Follow me on this: I have several backups with user data. The backups should be good, as no error occurred while saving data. So, I could perform the rebuild ignoring ECC errors, and restore saved data. Doing so it will ensure that backup data will be back in place. What could left out is damaged system/program files. In this case, sooner or later problems will pop out, and I could simply reinstall screweed files – Riccardo Mar 06 '13 at 22:07
  • Like I say, I can't really help you find it out. I suspect there is a way to find out from the RAID controller which LBAs were affected, which in turn lets you find out which blocks on the filesystem, which in turn lets you see what files, if any, were affected. I suspect this way is possible because I've seen it done elsewhere in slightly different circumstances, but I really cannot give you any more information than this. – Daniel Lawson Mar 06 '13 at 22:42
  • The server currently holds (a) system & programs (b) user data (c) backups. I suspect the failure occurs somewhere in the backup zone. If a filesystem check passes for (a) & (b), I could force a rebuild ignoring errors, and once done wipe (c). This should be safe enough....what do you think? – Riccardo Mar 07 '13 at 08:54
  • Unfortunately, I don't think you can say with any certainty that the failure occurs in the backup zone, without first finding out the LBA of the failed block(s) and working up from there. Filesystems don't always lay files out in linear order from the start of the drive, so just because it occured at 95% doesn't mean you don't have critical files there. That said, I think you should be safe enough. Although please tell me you aren't storing the backups for this system on the same RAID array that just failed... – Daniel Lawson Mar 07 '13 at 09:04
  • If the filesystem check (chkdsk) succeeds in the vital zones, shouldn't that be ok? As for the backup, it is only a redundant one. Offline backups exist too. – Riccardo Mar 07 '13 at 09:19
  • Just for testing purposes I'm copying data from the server to a workstation. The server has hanged on a few files, with 3dm2 showing timeout errors, however the copy continued. One of those files was readable, one of them has garbage in a part of it (word document). Shouldn't the copy halt on a failing file? – Riccardo Mar 07 '13 at 09:49
0

@Daniel this is the output from tw_cli

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-5    DEGRADED       -       -       64K     698.461   ON     ON

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     233.76 GB   490234752     V503YE9G
p1     ECC-ERROR        u0     233.76 GB   490234752     V503Y7VG
p2     OK               u0     233.76 GB   490234752     V503Y4GG
p3     DEGRADED         u0     465.76 GB   976773168     WD-WCAYUJ776908

Name  OnlineState  BBUReady  Status    Volt     Temp     Hours  LastCapTest
---------------------------------------------------------------------------
bbu   On           Yes       OK        OK       OK       255    18-Nov-2006
Riccardo
  • 253
  • 1
  • 3
  • 13