1

I'm opening this topic to find the cause of our SSD failure rate of +50%. We work with Mac (13" Macbook Pro & 13" Macbook Air), with installations of both Mac OSX Mountain Lion and Ubuntu. 90% of the users develop websites, the others do basic office work.

For now I tried the following brands:

  • OWC Mercury 6G: destroyed 5
  • OWC Aura Pro: destroyed 3
  • Intel 500: destroyed 2
  • Standard Apple SSD (sm256c): destroyed 2

All of this in less than one year!

The symptoms of a failing SSD are always more or less the same:

  • Extremely low read/write speeds (10MB/sec sequential)
  • Amnesia (chmod a directory, 5 minutes later the permissions are restored)
  • Lost files (/etc/hosts was gone)
  • Random crashes/hangs of software

From this I can conclude that the brand doesn't matter, enabling trim also doesn't make a difference. I also recommended leaving at least 30GB disk space free -> no difference.

What else could be the problem? We do run Mysql and Postgres databases on the machines, but I can't believe we're the only one having this problem. Is there some way we can track down what is causing our SSD's to fail?

Edit: So I think we all agree this isn't a normal behaviour. Do you know some OSX Applications which can monitor disk writes per application/process? I found some for Linux, but the majority of my users with this problem work on Mac OSX 10.8. Htop for example doesn't show me disk reads/writes in OSX, even under root.

ewwhite
  • 197,159
  • 92
  • 443
  • 809
user196611
  • 197
  • 3
  • 10
  • Can you clarify if the failures occur primarily under Ubuntu? – ewwhite Oct 29 '13 at 13:19
  • The failures occur primarily under Mac OSX. Ubuntu has destroyed only 1 SSD up to now. But the symptoms were exactly the same. So I guess something must be related – user196611 Oct 29 '13 at 13:22
  • 1
    I don't know the answer, but I will say that I've seen a few OWC ssd drives die in Snow Leopard and Mountain Lion and the cause for us wasn't the drive itself but the crappy ribbon cable it ships with that has to be bent weird in the Airs. It causes the crash and then only boots to a white screen (no apple logo). – TheCleaner Oct 29 '13 at 13:33
  • If you move one of these failing drives to another non-Mac computer do they behave? Have you tried doing a smart secure erase or TRIM for the entire drive? http://superuser.com/questions/308251/how-to-trim-discard-a-whole-ssd-partition-on-linux – longneck Oct 29 '13 at 13:51
  • Moving them to another mac or accessing them through an USB-interface shows the same problems. Secure erase fails most of the time and enabling Trim doesn't make a difference in lifetime – user196611 Oct 29 '13 at 13:59

1 Answers1

4

I have not seen such high numbers or SSD failure rates on MacBooks in my organizations...

Things to consider:

  • Typically the SSD failures will be a result of wear out from write activity. Check to see if there are any processes/programs common to the laptops that may cause more wear than normal.
  • Use a tool like SMARTReporter to track S.M.A.R.T. diagnostics and indicators on the drives.
  • Ubuntu probably isn't the best OS to run natively on a MacBook. What do you do about firmware and platform updates?
  • OWC stands behind their products and have extremely long warranties on their SSDs. You should be in touch with their support to understand why the devices are failing. They may be able to give you more information than we can.
  • Please don't fill these drives up. I try to keep things below 70% utilization. Spec larger drives if you must.

enter image description here

Edit:

For monitoring system and I/O activity, Apple's Process Monitor or something like gkrellm will help record read/write activity. It should be clear which processes become resource hogs.

enter image description here

ewwhite
  • 197,159
  • 92
  • 443
  • 809
  • I've never seen an app that can read correct smart-data from an SSD. Smart-tools like smartctl or apps like smartreporter or smart utility don't report anything wrong about the failing disks. Also, OWC doesnt' believe me anymore. They can't RMA an SSD which isn't 100% completely dead. I've already applied your other tips without success. – user196611 Oct 29 '13 at 13:39
  • Do these OWC SSDs report wear percentage used/remaining in their SMART data? If not, try installing an Intel SSD and monitoring attribute E9. You might find out whether or not your database installations are thrashing your SSDs with writes. If that's the problem, you need to switch to SLC or eMLC drives. http://www.intel.com/support/ssdc/hpssd/sb/CS-034531.htm – Skyhawk Oct 29 '13 at 13:50
  • This is the smart data I get from an Intel SSDSC2BW240A4: https://gist.github.com/thomasmeeus/7215349. And here's the data from an Apple SSD: https://gist.github.com/thomasmeeus/7215443. It clearly doesn't show all the parameters like we would get with an HD – user196611 Oct 29 '13 at 14:11
  • 1
    SMART on SSDs doesn't show the same parameters as a spinning drive, because the ones it doesn't show aren't relevant to an SSD (head fly height, for example, is meaningless when you don't have heads). However, your SMART app doesn't have the latest drivedb, which is why you get all those 'Unknown Attributes'. They will be useful to you, if you can work out what they are. I don't know what app you're using, so I can't comment on how to update them, but if you install the latest version of smartmontools it should be fine. – Daniel Lawson Oct 30 '13 at 20:51
  • Those results come from the latest smartmontools available in homebrew (6.2 stable). I just did an update of their database, but it still shows the same "unknown attributes". Guess smart indeed doesn't work well with SSD's. – user196611 Oct 31 '13 at 10:52
  • Go back to the manufacturer. OWC should be able to answer any questions about what's happening to your SSDs, provided they've had a chance to analyze. – ewwhite Oct 31 '13 at 10:54
  • @ewwhite like I said, this happens to a lot of SSD's, not only OWC. OWC did exchange some SSD's for us, but they can't keep doing it. It's seems that some kind of software is ruining our disks, so I want to find out what software. – user196611 Oct 31 '13 at 11:01
  • That's an issue in your environment. See if there are any programs or processes unique to the systems that could cause increased writes. Do you have enough RAM? Are these systems swapping to disk often? What is the nature of the environment and what software is in common use? Are you imaging systems or installing from scratch? – ewwhite Oct 31 '13 at 11:04
  • Installations are done from scratch with Chef. Each macbook has 8GB of RAM which is enough for 90% of the users. Swap only occur to the other 10%. Our development stack is PHP/Java, with Mysql/Postgres and some other tools. I begin to suspect Postgres, but I can't find a way to monitor disk-writes per application/process in OSX. – user196611 Oct 31 '13 at 11:08
  • Marginally relevant http://techreport.com/review/25320/the-ssd-endurance-experiment-22tb-update it takes a LOT of writes to kill off a SSD, even TLC based ones. I rather doubt its how much its writing alone, personally. – Journeyman Geek Oct 31 '13 at 11:27
  • @ewwhite OSX's Activity monitor only shows disks usage globally. Not per application. AFAIK, gkrellm doesn't do that either. Activity monitor does show a few Terabytes written on SSD per month, but still can't narrow down to what application is the cause – user196611 Oct 31 '13 at 11:34