3

I have 4 physical drives in a single virtual drive using an LSI MegaRaid SAS controller. It seems (at least) one of the drives has bad sectors because:

  • io errors occur when attempting to back up some files
  • running badblocks reports some bad sectors

I'm hoping that resolving the issue will be as simple as swapping out the problematic disk(s) and rebuilding the raid array. I thought LSI MegaRaid WebBIOS would allow me to identify the problematic disk(s) but I can't find any options to check for bad sectors.

Below is a screenshot of the WebBIOS: enter image description here

Could anyone offer any advice as to how the problematic disk(s) can be identified?

James
  • 325
  • 2
  • 11
  • 22

3 Answers3

8

Smartmontools has extensions that allow it to poll a drive for SMART data through an LSI (as well as others) RAID array. Normally, this isn't something you can do as the RAID abstraction obscures direct interfaces with the drives.

Smartmontools might not be installed on your machine. However, it is native to most "main repositories" of most distributions, and there is even a Windows version at: http://sourceforge.net/projects/smartmontools/files/

It can be used to poll a drive behind an LSI MegaRAID controller like so:

smartctl -a -d megaraid,N /dev/sdX

Where "-a" means display all disk data, -d means device type (megaraid being the type in your case), followed by N which means the drive number in that controller. To access the drive in slot 0, you would say 0 here. If you wish to poll all four of your drives, run this command four times, replacing N with 0 to 3. sdX is the RAID abstraction itself, as seen normally within the operating system. Yours is probably sda.

You will see a long output from each drive, and what you're looking for is either a reported general SMART failure (which you might not find, as your controller isn't rejecting drives), or reported "offline uncorrectable sectors" or "pending sectors". Any drive with more than 0 in this field is bad. No mercy should be given to those fields, as it takes a LOT of failed reads to increment either value by one.

You can also perform a short or long test like so (same rules above apply):

smartctl -t [long|short] -d megaraid,N /dev/sdX

Spooler
  • 7,046
  • 18
  • 29
  • You're right it is sda. Unfortunately when running the command I get `Smartctl open device: /dev/sda [megaraid_disk_01] failed: INQUIRY failed` – James Sep 30 '16 at 06:10
  • Silly question: are you running it as root? – Spooler Sep 30 '16 at 06:11
  • 1
    Yes running as root. I think I've got it working - the indexes are 2,3,4,6 rather than 0,1,2,3 as I'd assumed. I found this out by running `MegaCli -LdPdInfo -a0` - this shows index as the "Device ID: XXX" – James Sep 30 '16 at 06:21
  • Two of the disks have non-zero values for 'read' under 'Total uncorrected errors' and 'Non-medium error count'. Are these the values I should be looking at? One of them is 'DiskGroup: 0, Span: 0, Arm: 1' and the other 'DiskGroup: 0, Span: 1, Arm: 0'. Any advice what to do next? – James Sep 30 '16 at 06:25
  • Non-medium error count means anything other than write-read, or verify errors. They typically (if uncorrected) involve the drive "dropping out" of a controller for a time or resetting without signal to. Since you also have an uncorrected error count, you are running into multi-bit and many-bit errors, which are damning but very hard to track with anything other than SMART. You should replace those drives immediately. – Spooler Sep 30 '16 at 06:41
  • Am I right in thinking that both can be replaced since they are on separate spans? Should they be replaced one at a time - i.e. replace a drive, let it finish rebuilding and then replace the next drive? – James Sep 30 '16 at 06:50
  • 2
    Always replace one drive at a time, unless you're really good at (and enjoy) restoring broken arrays. This is true of pretty much any by-disk array membership. – Spooler Sep 30 '16 at 06:58
5

If the RAID passes the errors on to you, then obviously something is wrong that cannot be silently corrected.

If you get read errors, that means that all redundant copies of these blocks have been destroyed. The faulty drives are not ejected, because there are no spares.

If you get write errors, that means that one drive continues to report write errors, and the RAID cannot eject it because it is not currently redundant. You should never see a write error in a redundant setup, so if you do, replace the controller.

If you can add more disks, create a third mirror -- recovery will complain, and you will need to check the file systems, but you should end up with as much of your data intact as can be, and I'd expect any good controller to then kick out all broken disks.

Once you are back on a clean setup, set up scheduled checks to catch these errors before they become a problem.

Simon Richter
  • 3,317
  • 19
  • 19
  • Couldn't write errors also just be that - write attempts that were finally unsuccessful? – user121391 Sep 30 '16 at 13:15
  • 2
    @user121391, the drives are supposed to remap bad sectors on write, silently. If a drive reports a write error that means it has run out of sectors to remap to, so a large number of sectors has gone bad. That is usually reason to immediately kick the drive out. Propagating a write error upwards means that *none* of the drives could write to that sector. That is either the controller being broken and writing to an invalid sector (-> replace the controller), or all of your drives have severe problems and the entire setup needs to be investigated. – Simon Richter Sep 30 '16 at 14:33
  • @user121391, disk failures are either gradual and only detectable on access, or sudden and global. That is why you need to read and compare the data across all disks periodically -- any drive reporting a read failure is given a good copy by rewriting the sectors, that the drive should store in one of the remapped sectors, and the error is logged for the admin. If a drive fails to read the same sector again on the next check, throw it out and never buy from the same vendor again. – Simon Richter Sep 30 '16 at 14:40
  • I think I misunderstood your initial answer in the sense of "get write errors" as "get a report that write errors have occurred in a drive in the array" vs. "receive write error from the disk directly", that was the reason for my confusion. Now it makes sense. – user121391 Sep 30 '16 at 14:44
  • +1 for "If the RAID passes the errors on to you, then obviously something is wrong that cannot be silently corrected.". If the RAID reports errors while being used normally, it's already too late. (Of course, a special RAID utility reporting an error is something different). – Guntram Blohm Sep 30 '16 at 15:01
2

If you are using Linux or Windows then boot your system and use the megacli utility.

megacli -pdlist -aALL

In the results check the "Firmware State" line. The degraded disk will show as:

Firmware state: Offline
Vikelidis Kostas
  • 967
  • 1
  • 6
  • 16
  • AFAIK megacli also exists for Windows. – HBruijn Sep 30 '16 at 05:49
  • @HBruijn I wasn't aware of that. Thanks for mentioning it. – Vikelidis Kostas Sep 30 '16 at 05:51
  • 1
    IIRC both versions even support the same *"intuitive"* command line arguments – HBruijn Sep 30 '16 at 05:55
  • 3
    While a disk may not be degraded in an array, it can still be bad. It takes quite a bit for controllers to eject a drive in some cases, and if they are not "manufacturer certified" drives, they won't get automatically ejected unless they have a total SMART failure. In the meantime, they will still negatively impact the array. – Spooler Sep 30 '16 at 05:59
  • 1
    Just to confirm, all firmware states are 'Online, Spun Up' – James Sep 30 '16 at 06:41