How to theoretically prevent hardware failures in a collection of 10,000 servers?

Question

From a book about warehouse-scale computers:

Although it might be theoretically possible to prevent hardware failures in a collection of 10,000 servers, it would surely be extremely expensive.

How is it theoretically possible? Hardware failures are something like hard drives failing right? So then how could you prevent an accident like that?

@Anon so basically you're not preventing hardware failure, you're just implementing a RAID type system where you have backup servers? — , Jan 24 '11 at 02:30
@funk-shun: Not necessarily backup servers, just backup hardware in those servers (redundant hard drives, power supplies, etc.). With sufficient expenditure, the failure chance of each individual server can be brought arbitrarily close to zero. — Anon., Jan 24 '11 at 02:31
I think what they mean is preventing _data loss or downtime_ from hardware failure. I can't imagine a way to prevent hardware failure, even assuming infinite money to dump at the problem. — , Jan 24 '11 at 02:31
The only way to prevent hardware failure is to never turn the hardware on. Beyond that scenario triple backup/failover of servers (and that's simplistic), rotating the individual servers with their backups, rigorously backing up the servers, systematically and rigorously maintaining and testing the servers rotated offline, analyzing server usage esp. wrt to peak load usage - as a start. — , Jan 24 '11 at 02:46
Kirt Undercoffer: Technically, hardware can fails when powered off.. — Kedare, Jan 24 '11 at 10:51

score 2 · Answer 1 · answered Jan 24 '11 at 03:03

2

It's not theoretically possible. All hardware will fail eventually. This is why in any system in which the value of the data bears safeguarding you must implement appropriate backup and recovery strategies.

answered Jan 24 '11 at 03:03

par

241
2
7

score 0 · Answer 2 · answered Jan 24 '11 at 02:37

@funk-shun: Not necessarily backup servers, just backup hardware in those servers (redundant hard drives, power supplies, etc.). With sufficient expenditure, the failure chance of each individual server can be brought arbitrarily close to zero. – Anon.

How to theoretically prevent hardware failures in a collection of 10,000 servers?

2 Answers2