Software RAID sets disk to faulty after some days until next reboot

Question

My Debian(jessie)-based system sets one of my RAID disks to faulty after some days of running. If I reboot the machine - all is fine again for some days until the problem appears again.

Here's my environment:

The System is running Debian Jessie 64bit and has two physical disks which are used as a RAID1 with mdadm.

The system also uses LVM for a more flexible handling of partitions.

Inside the VirtualBox 5.1.10 environment there are two virtual machines running. The .VDI files of these machines are also located on the LVM mentioned above.

Now I have the problem that after a few days one of the disks seems to have errors - at least the RAID controller sets the disk to faulty. In the last two months both physical disks have been replaced by new disks but the problem is still there. For this reason I wonder if those were real disk failures or if the software RAID controller sets the disks to faulty although they are fine.

Are there any known bugs for this combination of software RAID, LVM and Virtualbox?

Some command output:

~# cat /proc/mdstat

Personalities : [raid1]                                                                                                                                                             
md3 : active raid1 sda3[0] sdb3[2](F)                                                                                                                                               
      1458846016 blocks [2/1] [U_]                                                                                                                                                  

md1 : active raid1 sda1[0] sdb1[2](F)                                                                                                                                               
      4194240 blocks [2/1] [U_]                                                                                                                                                     

unused devices: <none>

~# mdadm -D /dev/md1

/dev/md1:                                                                                                                                                                           
        Version : 0.90                                                                                                                                                              
  Creation Time : Sat May 14 00:24:24 2016                                                                                                                                          
     Raid Level : raid1                                                                                                                                                             
     Array Size : 4194240 (4.00 GiB 4.29 GB)                                                                                                                                        
  Used Dev Size : 4194240 (4.00 GiB 4.29 GB)                                                                                                                                        
   Raid Devices : 2                                                                                                                                                                 
  Total Devices : 2                                                                                                                                                                 
Preferred Minor : 1                                                                                                                                                                 
    Persistence : Superblock is persistent                                                                                                                                          

    Update Time : Sun Dec  4 00:59:17 2016                                                                                                                                          
          State : clean, degraded 
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync
       2       0        0        2      removed

       2       8       17        -      faulty   /dev/sdb1

~# mdadm -D /dev/md3

/dev/md3:
        Version : 0.90
  Creation Time : Sat May 14 00:24:24 2016
     Raid Level : raid1
     Array Size : 1458846016 (1391.26 GiB 1493.86 GB)
  Used Dev Size : 1458846016 (1391.26 GiB 1493.86 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 3
    Persistence : Superblock is persistent

    Update Time : Sun Dec  4 00:59:16 2016
          State : clean, degraded 
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0

    Number   Major   Minor   RaidDevice State
       0       8        3        0      active sync
       2       0        0        2      removed

       2       8       19        -      faulty   /dev/sdb3

~# cat /etc/fstab

/dev/md1        /               ext3    defaults        1 1
/dev/sda2       none            swap    sw              
/dev/sdb2       none            swap    sw              
/dev/vg00/usr   /usr            ext4    defaults        0 2
/dev/vg00/var   /var            ext4    defaults        0 2
/dev/vg00/home  /home           ext4    defaults        0 2
#/dev/hdd/data  /data           ext4    defaults        0 2
devpts          /dev/pts        devpts  gid=5,mode=620  0 0
none            /proc           proc    defaults        0 0
none            /tmp    tmpfs   defaults        0 0

There is no software raid controller. Use the **smart** tools to diagnose your physical disks. — Nils, Dec 11 '16 at 22:09
By "software raid controller" I meant the software tools doing the work that is normally done by the raid controller in a hardware raid. The smart tools don't show anything suspicious. — mschenk74, Dec 11 '16 at 22:12
So if it is not the disks it might be your real controller. Have you checked the firmware for it? — Nils, Dec 11 '16 at 22:21
I think there's fake RAID. Show RAID controller type and output of `cat /proc/mdstat`. — Mikhail Khirgiy, Dec 12 '16 at 05:39

score 0 · Answer 1 · answered Dec 12 '16 at 08:19

Before anything else, we want to see some information from your syslogs. When the kernel pulls a disc from a RAID array, there will be some information logged. On the most recent occurrence I can find, the critical line is

Nov 21 08:45:49 lory kernel: md/raid1:md1: Disk failure on sdb2, disabling device.

There will very likely be some other information logged immediately before that, giving indication of a metadevice element in Very Serious Trouble; in my case, they look like

Nov 21 08:45:49 lory kernel: end_request: I/O error, dev sdb, sector 1497413335
Nov 21 08:45:49 lory kernel: sd 1:0:0:0: [sdb]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Nov 21 08:45:49 lory kernel: sd 1:0:0:0: [sdb] CDB: Write(10): 2a 00 59 40 b6 bf 00 00 18 00
Nov 21 08:45:49 lory kernel: end_request: I/O error, dev sdb, sector 1497413311
Nov 21 08:45:49 lory kernel: sd 1:0:0:0: [sdb]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Nov 21 08:45:49 lory kernel: sd 1:0:0:0: [sdb] CDB: Write(10): 2a 00 59 40 b6 a7 00 00 18 00

So it'd be very useful to see this information at least from the last RAID event, if not from the last two or three (please clarify if HDD replacement has happened between any of these logs). I can't tell you where that will be logged under Debian, I'm afraid you'll need to know that.

Secondly, I take your point that you've already replaced both HDDs. I agree that that makes it unlikely that either HDD is at fault, though I'd still run a smartctl -t long /dev/sdX on each of them as a priority (not both at the same time, please!). It does make me wonder about the cabling, though. Next time this happens, you might consider swapping the cables around between the two discs when you power-down for reboot. If the problem swaps sides, you've got a very strong candidate. Or if you can afford it, just replace the bad drive's cables now, with either known-good or brand-new replacements.

And as a final passing note, why are you not mirroring swap as well? Having the persistent storage mirrored but not swap makes it quite likely that you'll get a kernel panic and reboot if a drive fails (and the VM is under load), and RAID device failure time is exactly the time you don't want unattended, unscheduled reboots happening.

There are no events in the syslog since the syslog is also written to the same raid where the error occured and therefore it cannot write anything to the logfile. The partitioning is the default partitioning layout of the Server provider that is hosting my server. — mschenk74, Dec 12 '16 at 08:40
@mschenk74 that shouldn't make any difference: exactly the same is true for the RAID failure I quoted to you. The metadevice is still writeable, as is the FS thereon; the RAID is merely degraded. If you can't write to that FS **at all**, more is happening than you have indicated in your question. If the FS is still writeable, but nothing is being logged, you may need to look closely at your syslogging. — MadHatter, Dec 12 '16 at 09:10
It seems that upon setting the drive to faulty this is signalled to LVM which then changes the FS to read-only mode. — mschenk74, Dec 12 '16 at 09:24
Again, not in my implementation, which also uses LVM on MDRAID RAID-1. If you're not logging anything, how do you know that's what is happening? I am getting the strong feeling you're either making assumptions (which is bad), or making inferences from data (which is good) which you aren't putting into your question (which is also bad). — MadHatter, Dec 12 '16 at 09:26
What I can see is: Errors on the command line due to read-only filesystem. To search for the reason of these errors I issued the commands shown in my question. In the output I can see that the RAID is degraded and so I think this is the reason for the FS switched to readonly. Such a behaviour was also mentioned in several places aound the web when I searched for "LVM on top of MD-RAID" — mschenk74, Dec 12 '16 at 09:56
Fair enough, but under apparently identical conditions I *don't* see that behaviour, so we're definitely in the realm of assumptions. Those don't make the best sysadminning. I understand that when the FS goes RO you won't get any local logs, but (assuming you don't get anything useful on the console) perhaps now is the time to investigate having it syslog to a remote syslog server, so that you can get complete logs from around the time of the problem. Without those, I still think you're just guessing. — MadHatter, Dec 12 '16 at 14:08

Software RAID sets disk to faulty after some days until next reboot

1 Answers1