how to detect hard disk failure in custom server

Question

We have a custom server with following details:

Motherboard Supermicro X8DTL-3
Raid Controller HP Smart array p400(512G BBWC)
HDD Backplane Supermicro SAS825TQ
3 Seagate Barracuda HDD with 1TB(Raid 5)
Host: Vmware ESXI 6.0
Vm: CentOS 6.x and 7.x

My server load has been abnormally increased. When I checked I faced with raid errors 1792 and 1779 in the boot process. After re-enabling RAID we checked hard disks and they were shown OK in raid management software.

Then we tested the hard disks with SeaTools for windows(SMART, short and long dst tests). Two hard disks has serious problems and tests were failed.

In a typical HP server like DL380 G7, HDD leds change color from green to orange to indicate problems but in a custom server like ours this feature is not available.

My question is, how we can detect hard disks problem before loosing data?

score 3 · Accepted Answer · answered Aug 02 '15 at 05:52

There should be tools available to query your RAID controller and determine the SMART status of the drives in the array. Not knowing the particular device you've got, I don't have any suggestions as to what to use.

Once you know what to use (and how to use it), you'll need to automate the monitoring, so it will proactively notify you when there is a problem (because you'll forget to check it manually -- I guarantee it). If you're lucky, the RAID controller's management tool may have such functionality built-in, but more likely you'll need to write some sort of script to run the management tool, and if it reports problems, to send you an e-mail.

Host OS is ESX so not many tool options without the hardware support from ESX. — albal, Aug 04 '15 at 13:03

score 0 · Answer 2 · answered Aug 02 '15 at 07:48

There are a lots of tools out there which will help you monitor your HDDs status and predict when they will fail or are they already failed in order to replace them as quick as possible.

Since you're not mention what OS you're running on the server, I cannot help you with more specific suggestions.

score 0 · Answer 3 · answered Aug 04 '15 at 13:03

0

The P400 is not supported in ESXi 6.0 and so you won't have health status from the Controller.

answered Aug 04 '15 at 13:03

albal

201
3
10

how to detect hard disk failure in custom server

3 Answers3