8

I'm working on a remote server (Dell Poweredge) that was a new install. It has four drives (2TB) and 2 SSD's (250 GB). One SSD contains the OS (RHEL7) and the four mechanical disks are eventually going to contain an oracle database.

Trying to create a software RAID array led to disks constantly being marked as faulty. Checking dmesg outputs a slew of the following errors,

[127491.711407] blk_update_request: I/O error, dev sde, sector 3907026080
[127491.719699] sd 0:0:4:0: [sde] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[127491.719717] sd 0:0:4:0: [sde] Sense Key : Aborted Command [current]
[127491.719726] sd 0:0:4:0: [sde] Add. Sense: Logical block guard check failed
[127491.719734] sd 0:0:4:0: [sde] CDB: Read(32)
[127491.719742] sd 0:0:4:0: [sde] CDB[00]: 7f 00 00 00 00 00 00 18 00 09 20 00 00 00 00 00
[127491.719750] sd 0:0:4:0: [sde] CDB[10]: e8 e0 7c a0 e8 e0 7c a0 00 00 00 00 00 00 00 08
[127491.719757] blk_update_request: I/O error, dev sde, sector 3907026080
[127491.719764] Buffer I/O error on dev sde, logical block 488378260, async page read
[127497.440222] sd 0:0:5:0: [sdf] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[127497.440240] sd 0:0:5:0: [sdf] Sense Key : Aborted Command [current]
[127497.440249] sd 0:0:5:0: [sdf] Add. Sense: Logical block guard check failed
[127497.440258] sd 0:0:5:0: [sdf] CDB: Read(32)
[127497.440266] sd 0:0:5:0: [sdf] CDB[00]: 7f 00 00 00 00 00 00 18 00 09 20 00 00 00 00 00
[127497.440273] sd 0:0:5:0: [sdf] CDB[10]: 00 01 a0 00 00 01 a0 00 00 00 00 00 00 00 00 08
[127497.440280] blk_update_request: I/O error, dev sdf, sector 106496
[127497.901432] sd 0:0:5:0: [sdf] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[127497.901449] sd 0:0:5:0: [sdf] Sense Key : Aborted Command [current]
[127497.901458] sd 0:0:5:0: [sdf] Add. Sense: Logical block guard check failed
[127497.901467] sd 0:0:5:0: [sdf] CDB: Read(32)
[127497.901475] sd 0:0:5:0: [sdf] CDB[00]: 7f 00 00 00 00 00 00 18 00 09 20 00 00 00 00 00
[127497.901482] sd 0:0:5:0: [sdf] CDB[10]: e8 e0 7c a0 e8 e0 7c a0 00 00 00 00 00 00 00 08
[127497.901489] blk_update_request: I/O error, dev sdf, sector 3907026080
[127497.911003] sd 0:0:5:0: [sdf] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[127497.911019] sd 0:0:5:0: [sdf] Sense Key : Aborted Command [current]
[127497.911029] sd 0:0:5:0: [sdf] Add. Sense: Logical block guard check failed
[127497.911037] sd 0:0:5:0: [sdf] CDB: Read(32)
[127497.911045] sd 0:0:5:0: [sdf] CDB[00]: 7f 00 00 00 00 00 00 18 00 09 20 00 00 00 00 00
[127497.911052] sd 0:0:5:0: [sdf] CDB[10]: e8 e0 7c a0 e8 e0 7c a0 00 00 00 00 00 00 00 08
[127497.911059] blk_update_request: I/O error, dev sdf, sector 3907026080
[127497.911067] Buffer I/O error on dev sdf, logical block 488378260, async page read

These errors occur for all of the four mechanical disks, (sdc/sdd/sde/sdf) SMARTctl passed all four disks, long and short tests. I'm currently running badblocks (write mode test ~35 hrs in, probably another 35 to go).

The following are the errors I've suspected/considered upon research

  • Failed HDD - Seems unlikely that 4 "refurbished" disks would be DOA doesn't it?

  • Storage Controller Issue (bad cable?) - Seems like it would affect the SSD's too?

    • Kernel issue, The only change to the stock kernel was the addition of kmod-oracleasm. I really don't see how it would cause these faults, ASM isn't set up at all.

Another noteworthy event was when trying to zero the disks (part of early troubleshooting), using the command $ dd if=/dev/zero of=/dev/sdX yielded these errors,

dd: writing to ‘/dev/sdc’: Input/output error
106497+0 records in
106496+0 records out
54525952 bytes (55 MB) copied, 1.70583 s, 32.0 MB/s
dd: writing to ‘/dev/sdd’: Input/output error
106497+0 records in
106496+0 records out
54525952 bytes (55 MB) copied, 1.70417 s, 32.0 MB/s
dd: writing to ‘/dev/sde’: Input/output error
106497+0 records in
106496+0 records out
54525952 bytes (55 MB) copied, 1.71813 s, 31.7 MB/s
dd: writing to ‘/dev/sdf’: Input/output error
106497+0 records in
106496+0 records out
54525952 bytes (55 MB) copied, 1.71157 s, 31.9 MB/s

If anyone here could share some insight as to what might be causing this, I'd be grateful. I'm inclined to follow occam's razor here and go straight for the HDD's, the only doubt stems from the unlikelihood of four failed HDD's out of box.

I will be driving to the site tomorrow for a physical inspection & to report my assessment of this machine to the higher ups. If there's something I should physically inspect (beyond cables/connections/power supply) please let me know.

Thanks.

Scu11y
  • 83
  • 1
  • 5
  • When you say SMART "ok", do you just mean the overall health? Are any individual raw counters for reallocated or pending sectors non-zero? Drives don't immediately declare themselves failed on the first bad sector, even though it is unreadable. Use `smartctl -x /dev/sda` or something. But it's highly suspicious that it's the *same* LBA on all disks. – Peter Cordes Jun 18 '19 at 06:42

1 Answers1

15

Your dd tests show the four disks all failing at the same LBA address. As it is extremely improbable that four disks all fail at the exact same location, I strongly suspect it is due to controller or cabling issues.

shodanshok
  • 47,711
  • 7
  • 111
  • 180
  • Okay thanks, This was actually one of the things that made me suspect a controller fault. Wouldnt that affect the ssd's too? – Scu11y Jun 17 '19 at 12:27
  • 1
    It's difficult to tell without further testing. Anyway, the first think I would control/replace is the cables attaching the controller to the backplane. – shodanshok Jun 17 '19 at 12:33
  • Sounds good. Would there be any meaningful data yielded from taking a multimeter to the cables and testing continuity? I'd like to try to give the higher-ups a definitive answer/fix tomorrow. – Scu11y Jun 17 '19 at 12:37
  • Also, I'm going to accept your answer, It makes sense and I appreciate your time. If anyone else has anything they think is worth checking (either hardware/software) I'd still love to hear it. Thanks again to this community for being a reliable source of knowledgeable second opinions. :) – Scu11y Jun 17 '19 at 12:40
  • 5
    High data-rate cables, as 6/12 Gbs SATA/SAS ones, are not only about electrical continuity, but mainly about signal clearness and low noise. Try to physically clear the connectors and reseat the cables. If the error persists, try changing them and, finally, try a different controller. – shodanshok Jun 17 '19 at 13:21
  • 2
    Same-LBA seems unlikely to be a cabling issue. Unless the data in that sector just happens to be some worst-case bit-sequence for some scrambling (to prevent extended runs of all-zero defeating self-clocking) or ECC over the SATA/SAS link. I'm not sure what encoding that link uses. Controller is plausible though; same LBA on each of multiple disks needs some kind of common factor explanation. – Peter Cordes Jun 18 '19 at 06:40
  • 1
    I'm wondering with a Ram backed perc, if it's the ram that's gone bad? – djsmiley2kStaysInside Jun 18 '19 at 11:20
  • 3
    @djsmiley2k It is difficult that all four `dd`ended cached on the same, failing RAM address. Moreover, PERC's DRAM is ECC protected and, while ECC RAM also fails, it is relatively uncommon. That said, the controller *can* be the source of the issues so, if changing cables does not help, the OP should try swapping the controller. – shodanshok Jun 18 '19 at 12:55
  • 1
    I'm just remembering long ago, a PERC card I encountered which battery failed and some how it reported that all writes were succeeding (as in they were written to the ram and the controller reported 'all is fine here') when in fact it hadn't written anything to the underlying system for around 6 months. I was speculating (and that's all, there's no research here) if i'd have seen the same thing had I tried to read from any disk.... as in, no matter the read, it fails with the same address being inaccessible....? – djsmiley2kStaysInside Jun 18 '19 at 16:45
  • I had a long call with dell today, They're planning to swap the controller & cables. I hope that resolves it. They told me the PERC wouldn't do anything because it's "just a passthrough to the mainboard". Also, we had some bad ECC mem when I first set the machine up and dell sent us replacement memory. Hopefully this will be the last service call but somehow, I doubt it. Thanks to everyone here for your valuable insight. I appreciate all of you. Thanks – Scu11y Jun 18 '19 at 20:26
  • 1
    @Scu11y please report back if the controller/cables swap solves your problem. – shodanshok Jun 18 '19 at 20:57
  • 2
    Well my friends, you were right. Cables + controllers swapped and now 600GB into a dd zeroing process and no errors thus far. Looks like everything's working correctly now. Thanks again for all the knowledge you've shared. I'm always grateful to this community for your expertise and willingness to share it. :) – Scu11y Jun 19 '19 at 21:27