3

In the IT world I just won the Lottery Twice....

Today we had a hard drive fail in a raid array. A few hours later we had another drive fail on a different server.... We started checking all the environmental logs and systems immediately. Humidity is 40%, temp is at 75*, no dust or other particulates flying around. We checked the UPS logs, no spikes reported. About 3 hours later another hard drive failed on a 3rd system....

To recap 3 HP DL380 G7's, these servers are all sequential serial numbers. The drives are not from the same lot though I bet the array controllers and boards are. HP will be out in the morning.... In the meantime we are hoping this does not become a habit... We have had 1 drive fail in this entire server rack in 2.5 years. Today 3 within 12 hours!

What else should we be looking for? Has anyone else had a similar problem?

Any help is greatly appreciated. This incident has consumed our spares.... If we have another fail we will be looking for HP to swap them.

Update: These are 146 GB 10k rpm SAS Drives and one 300 GB 10k rpm SAS Drive. HP original equipment.

DaffyDuc
  • 512
  • 2
  • 7
  • 2
    Did someone [yell at the server rack](http://www.youtube.com/watch?v=tDacjrSCeq4)? – MikeyB Jan 28 '14 at 22:05
  • 1
    I had the same problem 5 years ago: http://serverfault.com/q/22448/7709 - oh shit... that was 5 years ago. Damn I'm feeling old... – Mark Henderson Jan 28 '14 at 22:13
  • 1
    You could also have had a heat or other environmental problem in the past, that impacted these drives, and they just failed now. – mfinni Jan 28 '14 at 22:22

1 Answers1

3

These things happen... You'd be surprised what I've seen withe the same equipment at scale.

You did right by checking your environment for ESD, temperature and power issues.

Being ProLiant DL380 G7 units, the array controllers are embedded on the system board. Lot numbers aren't controlled too tightly there. I don't think this is anything beyond coincidence. However, this may be a good time for some firmware updates, as false drive failures are sometimes symptomatic of bad revisions.

Since you have support, let HP deal with the parts/replacement and move on :)

BTW - It would be helpful to detail the drive capacities and type involved (SAS, SATA, Nearline SAS)

ewwhite
  • 197,159
  • 92
  • 443
  • 809
  • I added the drive type. These are 146 gb SAS drives and one 300GB SAS drive – DaffyDuc Jan 29 '14 at 12:56
  • @DaffyDuc They fail. I have a pile of 18 146GB 10k SAS disks that need to be sent back for warranty repair. – ewwhite Jan 29 '14 at 12:59
  • I know these are mechanical, they are going to fail however the probability of having no failures over the course of 3 years and then 3 in one day is pretty incredible. I can believe its just dumb luck however I want to ensure we have covered all of our bases before handing in an incident report that indicates a black swan visited and brought his purple brother... :) Thank you for posting though... I hope by the end of the week I don't have the same stack... lol – DaffyDuc Jan 29 '14 at 13:18
  • 1
    I think [@dennis-kaarsemaker](http://serverfault.com/users/144990/dennis-kaarsemaker) has [a really helpful image](http://chat.stackexchange.com/transcript/message/8815130#8815130) of drive failure rates for large installs. – jscott Jan 29 '14 at 13:22
  • Short of firmware on the controllers (some revisions reported false failures), this is what it is. You may want to take this time to update system firmware anyway. – ewwhite Jan 29 '14 at 13:23