Disk errors, drives going into read-only mode and HP P410i Smart Array resetting logical drives

Question

I manage a HP ProLiant DL380 G6 server for a student association which was going to be thrown away by our university. The server has a P410i hardware raid controller which we use for a 3 drive RAID 5 for our OS and a 4 drive RAID 10 for our Owncloud data folder.

Everything ran smoothly for the most part until recently when we started getting a lot of disk errors and the logical drives going into read-only mode until repaired with fsck. DMESG shows a lot of IO errors and messages about the logical drives being reset with only 1 second between the resetting and reset successfully messages:

DMESG log

Aside from a cache battery failure the smart array seems to be working fine and the physical drives and logical drives show no errors and have the status OK in hpssacli. The firmware version is faily outdated though, version 1.62-0. I have tried upgrading to the latest firmware version but I got the same issue as in the question How can I update the SmartArray P410i firmware on a DL360G6? The usual method via SPP Auto-Update fails, but I'd only like to use the proposed solution as a last resort since it could brick our RAID controller.

I'm not sure if our drives are failing or if it's (a bug in the firmware of) our RAID controller that is causing the issues, could anyone provide some insight?

EDIT: the boot drive is in read-only mode again and fsck is giving segmentation faults

score 0 · Accepted Answer · answered Oct 01 '19 at 10:30

0

sdb is dying, as is does not respond to host commandbin a timely manner. However, from what I can understand, sdb really is an array or logical volume/disk, so it does not represent any single disk.

The most probable causes are:

one (or more) physical disk is dying, maybe due to a storm of reallocated sectors. Do your physical disks support TLER? Are they enterprise-grade disks?
the controller itself has some problem. This can, for example, be related to its age or operating temperature.

answered Oct 01 '19 at 10:30

shodanshok

47,711
7
111
180

You're right, everything is back to normal after unmounting sdb and no more errors in dmesg for at least 8 hours now. I'll try running some SMART checks. Is it normal that that sdb being mounted upsets the raid controller to the point that it corrupts the boot disks as well, or is that an issue in the raid controller? – Kaascroissant Oct 01 '19 at 20:03
@Kaascroissant I don't think `sdb` "corrupts" the boot disk. Rather, the controller is probably stuck trying to read from the component disks of the `sdb` array. If `sda` (another array on the same controller?) is OK, I would look at the disks used for the `sdb` array, replacing any failing one. – shodanshok Oct 01 '19 at 21:44
Yeah sda is an array of 3 SSDs running the OS, sdb is 4 HDDs for our owncloud data folder. sda has no issues when sdb is unmounted, but when it is sda will also often be placed in read only mode and most recently had segfaults when trying to run fsck and could not even run the reboot command until I physically turned it off and on again, was then fixed by the forcefsck flag. The sdb disks return smart status OK, can't do a full report because "ATA output registers missing", self tests complete without error. I found out they have all been running 44.000+ hours though. – Kaascroissant Oct 01 '19 at 22:12

Disk errors, drives going into read-only mode and HP P410i Smart Array resetting logical drives

1 Answers1