Is there a reason to change a server's hard drive before it faults?

Question

Just a quick question: is there a reason to change a server's hard drive after x years before it faults (it eventually will at some point) or should I just leave it until it faults? I have little experience with actual server administration so I wonder...

I did not expect to get so many answers, wow :) After reviewing all of them, and taking into consideration that a) The server's harddrives are adequate for it purposes b) Backup is absolutely guaranteed (Using RAID + Replication Slave + daily backup to an external source) I find no reason to suggest a drive change. Thanks all! — Spiros, May 22 '10 at 15:42

score 8 · Accepted Answer · answered May 21 '10 at 12:35

A great reason to change it is if you want to add another task to your list of things to do while increasing the chances of something going wrong.

All joking aside, there really isn't any reason I've heard of to change the drive ahead of time. If you have RAID in place, you already have protection in place (assuming you have decent backups), and you're not generating waste material in the form of a dead drive to dispose of and you don't have to needlessly work on eliminating sensitive data from the drive. You won't be spending extra money on new drives and you still won't be proactively protecting against things that could still go wrong anyway, like a faulty drive controller, which isn't common as a drive fault source but can happen.

On the other hand this might help you discover unrecoverable drive errors that aren't triggering alarms on the RAID unit, as we had happen with RAID 5. We were bitten by this and ended up needing to rebuild from bare metal from backup (so even in that case, a proper backup will help you recover.) A RAID level that takes into consideration today's larger drive capacities and unrecoverable error tolerances would have helped us, if not, backups save the day.

Most administrators have a decent RAID and backup plan so there's no real need to generate extra waste by replacing the drives needlessly.

score 6 · Answer 2 · answered May 21 '10 at 13:06

The only time I might consider this is if I had a bunch of disks from the same batch, and others in the batch had started failing, then I might consider it.

If I was tight on space, then sure, I'd do it -- but for no other reason than just because it's getting old? No, because on average the failure rate in the first year is similar to the failure rate any other years. (note that the graph breaks out the first year over 3 month, 6 month, 1 year, but you'd have to add them all together to get the chance of failure at 1 year). And when looking at high disk utilization, it's more likely to fail in the first year than in the next three years combined.

The only correlation to late drive failure was in hotter rooms, and we keep our server rooms cool.

score 5 · Answer 3 · answered May 21 '10 at 12:25

5

I'm all for being proactive, but I've never done it and have never heard of anyone doing it. Presumably you have some type of RAID setup and have regularly occurring, valid backups for the system(s) in question.

answered May 21 '10 at 12:25

joeqwerty

109,901
6
81
172

5

+1, Never considered it. Replacing a disk, just-in-case, and intentionally triggering an array rebuild doesn't seem like the best way to "exercise" the remaining production disks. Be harder to explain to the boss why the system is down if the rebuild failed. – jscott May 21 '10 at 12:30
3

I replace disks that have SMART errors, but I would consider them failed, even if they to still technically work. – Chris S May 21 '10 at 13:29

score 4 · Answer 4 · answered May 21 '10 at 13:28

Yes, performance and capacity. If the old hard drive does 70MB/sec sustained reads and 100 IOPS and the potential replacement does 200MB/sec sustained reads and 175 IOPS and also has 3 times the capacity you might be justified to buy new drives and swap out old for new simply for performance/capacity reasons. (and those numbers are totally made up, the point is newer can be significantly faster).

Now what do you do with the old drives. You might use them in a test server, or add them to a backup to disk array, or hold on to them as emergency spares. Or you might just wipe them and send them away for disposal.

Your average server now days is IO bound more than it is processor bound (or at least all of mine are). So if you have a really old server that has no issues with CPU time or Memory shortages you likely have room to significantly improve performance by replacing hard drives that are several generations behind what you can easily purchase to replace them with.

score 3 · Answer 5 · answered May 21 '10 at 12:36

It depends of the impact if the hard drive fault.

If you don't have a RAID
If you don't care about the server availability because the service can be stoped or because it's in high availability and if you have a working backup of data. I would say Ok, let the drive die and change it and restore data when it will fail.
If you care about availability, I will say use RAID ;)

If you have a RAID (1, 5, 6, ...)
I would say, why changing the hard drive before fault ? RAID (and backup) is here for that. Changing a hard drive just in case it could fail is a risk to broke something (raid reconstruction is always risky)

But it's only my point of view ! If you think your drive may be too old, you may want to change your server too.

score 2 · Answer 6 · answered May 21 '10 at 12:28

2

Some disks die in 1 hour, others last 2 decades.

If it's not failed or failing (something you can usually establish via S.M.A.R.T. monitoring or performance problems) then the only other reason to throw it out is if it's not large enough or fast enough for your purposes.

answered May 21 '10 at 12:28

Chris Thorpe

9,953
23
33

1

Just monitor the drive with S.M.A.R.T. and it will usually show the signs of failure before it is too late. – Prof. Moriarty May 21 '10 at 13:12
@Prof Google's mass disk study showed SMART was "usually" reliable 44%-72% of the time. http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en/us/papers/disk_failures.pdf – jscott May 21 '10 at 14:10

score 2 · Answer 7 · answered May 21 '10 at 12:36

With disks, the question is not if they will fail, but when. They're mechanical devices (unless using SSDs, but they have their own caveats), so they will fail, sooner or later.

Disk vendors tend to tailor their manufacturing processes to be as cheap as possible, because even a single cent saved per disk can be quite important when you're producing and selling thousands of them; but they of course don't want their disks to fail before the warranty period ends, or they'd be replacing them for free all the time; so, they'll happily spend as much as needed to have them last as long as the warranty covers them... but not a single cent more.

The end result is: most disks tend to fail soon after the warranty period ends. This is of course not a general rule, it's only statistics, and your disk could fail now or last until you won't need it anymore... but, statistically, there are lots of disks that fail a few days or months after their warranty expired.

Of course, buying new ones when you still don't need them can be costly... but replacing them after the warranty expires and they've failed will be costly anyway.

Now, if you could find a way to make them fail while still warranted (and not losing data in the process, i.e. having good RAID AND backups), well, that would be optimal ;-)

score 2 · Answer 8 · answered May 21 '10 at 12:58

2

I wouldn't replace a working drive any more than I'd replace a working power supply. Both will eventually fail but it makes no sense, either technically or financially, to replace them without good cause. Replace them when they start to show signs of trouble.

In the case of hard drives the trend is that if a drive is going to fail early it will more than likely do so in the first year. Drives that have run trouble free for 6 years can normally be relied on to continue to work for at least a few more years yet. Obviously there are plenty of exceptions to that but it is the general trend.

answered May 21 '10 at 12:58

John Gardeniers

27,458
12
55
109

1

You (usually) don't lose data when a power supply fails... – Massimo May 21 '10 at 13:54
1

@Massimo - True, but on a server you also don't usually lose data when one drive fails. In my opinion, if there's no redundancy it's just a glorified workstation, not a real server. – John Gardeniers May 21 '10 at 21:50

score 1 · Answer 9 · answered May 21 '10 at 13:30

Also, keep in mind that most server class drives have more stringent manufactoring requirements and are typically more reliable than low cost/budget desktop drives. So, aside of the dangers of replacing a 'good' drive in the event of it possibly failing, doing this for a large array can add up to a large sum of money.

Also, when using a RAID, that is why it is a good idea having at least one hot spare in the server, so it can quickly begin to rebuild and remain healthy till you purchase replacements on a as-needed basis.

score 1 · Answer 10 · answered May 21 '10 at 13:54

I've done it on "zero-downtime" systems. Really though, you're just as likely to lose a different drive when the RAID rebuilds...I swapped one out once, then ended up swapping it back in when another drive started throwing errors during the rebuild.

It's a philosophy question really: if you believe in pro-active stress testing (both of the array and of your cardiovascular system) then you should swap your drives. But really, you're never going to know which drive is going to go bad next. It's not at all unlikely that you could lose the newly replaced drive before you lose any of the older, proven drives.

That being said, I'd waste my time on stress-testing my backup solution, and leave the drives in peace until they start actually throwing errors.

Is there a reason to change a server's hard drive before it faults?

10 Answers10