3rd party SSD drives in HP Proliant server - monitoring drive health

Question

As discussed in a previous question, we have 6 OWC Mercury Extreme SATA SSD drives installed in our HP Proliant DL360 G7 server (using a P410i RAID controller). They work great, and are very fast. However, I'm aware that SSD drives unfortunately don't last forever, and the HP ACU utility, not surprisingly, won't monitor the health of any of the drives:

enter image description here

Does anyone know of any Windows (Server 2008R2) software or utilities that will allow monitoring of the health of each individual drive in the array, so that we can proactively pick up on any potential issues?

I'll let `ewwhite` or someone that really knows run with the answer, but I did find this for you. "HP Solid State Drives are equipped with tools that can report the amount of lifetime remaining. Introducing HP SMARTSSD Wear Gauge™. In order to take advantage of SMARTSSD Wear Gauge™, Smart Array Firmware version 5.0 or greater is required and HP Array Configuration Utility (ACU) or HP Diagnostic Utility (ADU) must be running" — TheCleaner, Aug 20 '13 at 19:07

score 5 · Accepted Answer · answered Aug 20 '13 at 19:19

5

You can use smartctl to peek at individual drives behind a cciss RAID controller like so:

smartctl -a -l ssd /dev/sda -d cciss,1

or:

smartctl -a -l ssd /dev/sda -d sat+cciss,1

(you may need to remove -l ssd if your smartctl is too old)

answered Aug 20 '13 at 19:19

MikeyB

39,291
10
105
189

What does the "1" after cciss denote? – ewwhite Aug 21 '13 at 07:22
Got it. Drive position. The second command string works on the ProLiant with these disks. [**Here's the output**](http://pastebin.com/01kNfw5h). – ewwhite Aug 21 '13 at 07:29
@ewwhite you'll see more data with an updated version of smartmontools. Clone the [repo](https://svn.code.sf.net/p/smartmontools/code/trunk/smartmontools), build it, then run it out of the directory. No need to install if you don't want to. – MikeyB Aug 21 '13 at 14:46

score 3 · Answer 2 · answered Aug 20 '13 at 20:43

3

Don't bother... Really.

You have an enterprise server with an enterprise RAID controller and hot-swappable drives (with a 5-year warranty), presumably in a RAID 1+0 setup. Do you care why a drive fails beyond the fact that it fails? I don't. I wouldn't care why a spinning disk died either (S.M.A.R.T. errors, bearing failure, overheating, etc.)

High-end (SAS) HP Solid State drives do provide some additional health information. But if you're using RAID and know where to get a spare, I don't think this information is tremendously helpful. You get temperature readings and an "Estimated Life Remaining" figure.

That is all.

  physicaldrive 1I:1:4
     Port: 1I
     Box: 1
     Bay: 4
     Status: OK
     Drive Type: Unassigned Drive
     Interface Type: Solid State SAS
     Size: 400 GB
     Firmware Revision: HPD9
     Serial Number: 00197356
     Model: HP      MO0400FBRWC     
     Current Temperature (C): 29
     Maximum Temperature (C): 43
     Usage remaining: 99.57%
     Power On Hours: 6418
     Estimated Life Remaining based on workload to date: 61922 days
     SSD Smart Trip Wearout: False
     PHY Count: 2
     PHY Transfer Rate: 6.0Gbps, Unknown

answered Aug 20 '13 at 20:43

ewwhite

197,159
92
443
809

One good reason: it allows you to monitor the life of your SSD and potentially extend their life by moving them to a lower-demand server when they approach their expected life. – MikeyB Aug 20 '13 at 20:53
These disks are in a RAID array. TRIM commands aren't being passed through the controller. The drives are over-provisioned (15%) and it's just as easy to keep a spare or just leverage the warranty. I don't see people encounter SSD wearout on appropriately-spec'd drives often. – ewwhite Aug 20 '13 at 20:56
They probably won't encounter SSD wearout, but it's still very useful to be able to demonstrate this. The SMART wearout counter is likely to hit the threshold at which SMART reports the drive as failed at the same time, more or less, on all the drives. Having a spare won't help you if all your drives are marked bad at the same time, but being able to predict that you'll need to replace them in so many days is useful. – Daniel Lawson Aug 21 '13 at 03:04
I have an OWC Electra SSD in a system, and it has a Sandforce controller which reports SSD life left as a pre-fail condition. Once it hits this threshold, SMART will mark it as failing. An Intel 320, on the other hand, reports life as an old-age condition, so SMART won't say the drive is failing once it hits wearout. Given that OPS has a non HP-drive in the array, and behaviour at SMART wearout is perhaps uncertain, I'd err on the side of caution and make sure I know when I'm going to hit that wearout point. – Daniel Lawson Aug 21 '13 at 03:11
HP controllers factor SMART statistics into the drive health equation. Behavior IS known. I'm the one who recommended the solution :) – ewwhite Aug 21 '13 at 07:16
The point is, don't try to outsmart your RAID controller. You wouldn't do this for spinning hard disks... this really shouldn't be different. – ewwhite Aug 21 '13 at 07:26
That was my point: If HP controllers factor SMART statistics into drive health, and the SSD the OP is using reports a SMART failure at the point the SMART media wearout passes its critical threshold (which, if it's similar to the OWC SSD I have, it will), then you'll have your RAID controller dropping all those drives in a fairly close timeframe with a SMART failure. You need to know this is going to happen before it happens, and so tracking your wearout is critical. If it was an HP SSD I'd trust the controller, but as it's not I'll trust my own monitoring. – Daniel Lawson Aug 21 '13 at 08:29
Most of HP's own SSDs don't provide any additional stats. Only the enterprise SAS SSDs do. The OP even had one of the HP SATA SSDs before these OWC drives. That product doesn't give a wear indicator either. – ewwhite Aug 21 '13 at 12:08

3rd party SSD drives in HP Proliant server - monitoring drive health

2 Answers2

Linked