1

My Ubuntu Linux server has an mdadm array (RAID 5) with four 2TB SATA disks that keeps "loosing" two disks from time to time. Rebooting and re-assembling the arrays has worked out fine up until now.

Hardware is a Dell PowerEdge T20 with an Exsys EX-3400 card that provides four additional SATA ports. Two of the fours disks in the RAID array are connected to the Exsys card, and the remaining two disks are connected to the onboard SATA ports (the remaining onboard SATA ports are in use for other disks). I checked for disk faults using smart utilities, they all seem good.

The disks that are being "lost" from the RAID are the two connected to the add-on SATA controller, so I replaced the add-on card with another one (didn't help, same symptoms). I replaced the SATA cables of the relevant disks (didn't help, same symptoms).

Does anyone have an idea what the source for these issues might be, and what else I could test?

MadHatter
  • 79,770
  • 20
  • 184
  • 232
brennmat
  • 21
  • 2
  • What are the kernel messages when the disks drop out? – wurtel Mar 18 '15 at 10:52
  • I don't know. How do I check that? Once the disks drop out the computer becomes inaccessible and I have to reboot. – brennmat Mar 19 '15 at 13:18
  • I always keep the root filesystem on RAID1 so that all disks but one can stop working and I can still login to diagnose the problem... you didn't write that you couldn't access the system anymore. Any messages on the console screen? – wurtel Mar 19 '15 at 13:47
  • I have the root file system on /dev/md0, swap on /dev/md1, and /home on /dev/md2. /dev/md0 and /dev/md1 are RAID1, all disks are connected to the on-board SATA ports. No issues with /dev/md0 and /dev/md1. /dev/md2 is where the problems are (RAID5 with 4 disks). Once the /dev/md2 has problems, I can't see any messages because the computer seems locked up / inaccessible. – brennmat Mar 19 '15 at 17:15

2 Answers2

1

It is not mdadm, mdadm only controls the kernel-based software raid functionality.

You don't need to reboot to reassemble an array. (Maybe only if it is your root partition.)

Putting the corresponding kernel messages (you can get them with a dmesg command) would help a lot, although I can say nearly surely, what is the cause of your problem. And it is probably the power supply, despite you say the problem is only on the attached controller.

You could easily test it, if it is a power problem: only plug your data cables between the additional SATA and the original. Do the problems happen exclusively on the additional controller?

If no: there is a power supply problem, you need to find a power suppy solution. In "normal" hardware I would buy a better power supply, in your case I suggest to ask a new, more hardware-specific question.

If the problems happens exclusively, always, in every power/data cable configuration, on the additional card: then the problem is probably with the card. Try to get a new one, or a different type.


P.s. You can plug the power and data cables as you want, linux software raid is smart and can recognize the hardware devices (he does this by auto-generated keys in the raid superblock).

peterh
  • 4,953
  • 13
  • 30
  • 44
  • How can I debug the dmesg messages if the computer gets locked up once the mdadm array fails? Is there any way of logging things so I can look at it after rebooting the machine? – brennmat Mar 19 '15 at 17:17
  • Why do you think this could be a power supply issue? – brennmat Mar 19 '15 at 17:18
  • @brennmat Its my own experience - if things are so mysterious, it is mostly power. But I suggested how can you detect that - if it is really power, then the disconnects shouldn't depend on the sata data port, but on the sata power port. This is an easy way to test, do this and share what you found. – peterh Mar 19 '15 at 17:25
  • @brennmat Yes. The simplest way is to have a concurrent, chrooted rescue system in a ramfs. If you don't know what is that, I will make a question for you. – peterh Mar 19 '15 at 17:33
  • Ok, did get something out of dmesg that seems to be related to my mdadm array: **** [79720.774431] EXT4-fs warning (device md2): __ext4_read_dirblock:901: error reading directory block (ino 97386542, block 5) [79725.782320] EXT4-fs warning: 364231 callbacks suppressed [79725.782323] EXT4-fs warning (device md2): __ext4_read_dirblock:901: error reading directory block (ino 97386542, block 5) *** Does that help somehow? – brennmat Mar 20 '15 at 13:23
1

I tried all of the above tips. Even switching cables (power, SATA) didn't change the symptoms. The two disks connected to the add-on SATA controller kept being lost from the mdadm array, so I tried yet another SATA controller. No luck. I ended up rearranging the whole machine so I could live without the add-on SATA controller. The mdadm array has been stable for a few days now, and I hope it will remain stable.

brennmat
  • 21
  • 2