I am in charge of a large number of Windows servers. Recently, many have been reporting hard drive errors with event codes 11 and 55. CHKDSK indicates that the drives are fine most of the time. What other diagnostic tools could I use to more accurately detect hard drive failures? Could these Windows events be false positives? I have already evaluated S.M.A.R.T., and it seems to have significant sensitivity and specificity issues.
-
2What "significant sensitivity and specificity issues"? It's what everyone uses – Chopper3 Jun 25 '13 at 17:56
-
Thanks for the quick response! Often, failed drives give exactly zero indication, and indicators give a very weak suggestion that failure is possible in the near to distant future. Hence, significant sensitivity and specificity issues. See reference #2 here: http://en.wikipedia.org/wiki/S.M.A.R.T – Francis Jun 25 '13 at 18:01
-
1I don't need a link to that page, seriously - it's what everyone uses, whether you feel you have to set custom thresholds or whatever that's pretty much your only option. – Chopper3 Jun 25 '13 at 18:08
-
Remember that `chkdsk` only checks the ***logical*** structure of the disk (the filesystem) unless you're specifying `/r` -- and frankly if you get to the point of using `chkdsk /r` (and it finds bad sectors) you should consider your drive dead. – voretaq7 Jun 25 '13 at 18:31
-
There is no oracular way to devise whether a drive will fail. SMART attempts to predict it using available information, but sometimes they just die. – Falcon Momot Jun 25 '13 at 18:32
2 Answers
You detect hard drive failures by monitoring your RAID controller (or software RAID status) for drive failures.
When a drive fails, you replace it as quickly as possible.
Anything else is a proxy for predicting failure (which is useful, though not as critical with RAID).
At the moment there is no better tool than SMART for predicting failure (the very article you reference - which is still the definitive work 6 years later - shows a definite correlation between certain SMART errors and drive mortality).
SMART based monitoring suffers from a high "False Negative" rate, but positive predictions of failure can be regarded as extremely reliable (and false negatives are accounted for, again, by RAID).

- 79,879
- 17
- 130
- 214
-
Vote++. This is a basic systems management task. Have hardware. Have failures. Therefore, monitor your hardware status. Simples. – Simon Catlin Jun 25 '13 at 19:29
-
Thanks! This is quite helpful. Would a SAS controller have similar functionality available? Also, what types of software could monitor the controllers? – Francis Jun 25 '13 at 20:02
-
SAS (in fact most modern SCSI) controllers have similar functionality to SMART - I believe `smartd` can talk to the drives. RAID controllers usually have their own monitoring agents (or report via IPMI) - you would have to check your specific monitoring solution for integration instructions. (I know on Dell systems if you install OpenManage you get pretty extensive monitoring/reporting capabilities. IBM and HP have equivalent software available.) – voretaq7 Jun 25 '13 at 21:56
Depending on the server manufacturer there is probably a tool or tools made to monitor the hardware from a central console. Dell uses Openmanage which will generate alerts for problems that are hardware related. HP and IBM have similar tools.

- 1,147
- 11
- 20