What happens to Local SSD if the entire zone were to lose power?

Question

What happens to data on local SSD if the entire google data center were to suffer a cataclysmic loss of power? When the compute engine instance comes back online eventually, will it still have the data on the Local SSD? It seems like it handles planned downtime just fine:

No planned downtime: Local SSD data will not be lost when Google does datacenter maintenance, even without replication or redundancy. We will use our live migration technology to move your VMs along with their local SSD to a new machine in advance of any planned maintenance, so your applications are not disrupted and your data is not lost.

But I'm concerned about unplanned downtime. Disk failure is an ever-present risk, but if you combine Local SSD with replication, you can protect against that. However, I'm trying to guard against correlated failure, where e.g. the whole region goes dark. Then the in-memory replicated data is lost, but does the data fsynced to the local SSD likely survive when the instances come back up? If you lose it, then fsyncing data to local SSD really doesn't buy you any more safety than RAM - e.g. for running a local database instance or something.

Misha Brukman · Accepted Answer · 2016-01-11T14:38:26.117

As an aside, please note that Google data center power supplies are redundant and have backup power generators in case of correlated power supply failures:

Powering our data centers

To keep things running 24/7 and ensure uninterrupted services, Google’s data centers feature redundant power systems and environmental controls. Every critical component has a primary and alternate power source, each with equal power. Diesel engine backup generators can provide enough emergency electrical power to run each data center at full capacity. Cooling systems maintain a constant operating temperature for servers and other hardware, reducing the risk of service outages. Fire detection and suppression equipment helps prevent damage to hardware. Heat, fire, and smoke detectors trigger audible and visible alarms in the affected zone, at security operations consoles, and at remote monitoring desks.

Back to your questions. You asked:

Then the in-memory replicated data is lost, but does the data fsynced to the local SSD likely survive when the instances come back up?

Per the local SSD documentation (emphasis in the original):

[...] local SSD storage is not automatically replicated and all data can be lost in the event of an instance reset, host error, or user configuration error that makes the disk unreachable. Users must take extra precautions to back up their data.

If all of the above protections fail, a power outage would be equivalent to an instance reset, which may render local SSD volumes to be inaccessible—a VM is very likely to restart on a different physical machine, and if it does, the data would be essentially "lost" as it would be inaccessible and wiped.

Thus, you should consider local SSD data as transient as you consider RAM to be.

You also asked:

However, I'm trying to guard against correlated failure, where e.g. the whole region goes dark.

If you want to protect against a zone outage, replicate across multiple zones in a region. If you want to protect against an entire region outage, replicate to other regions. If you want to protect against correlated region failures, replicate to even more regions.

You can also store snapshots of your data in Google Cloud Storage which provides a high level of durability:

Google Cloud Storage is designed for 99.999999999% durability; multiple copies, multiple locations with checksums and cross region striping of data.

Thanks Misha, that's basically the same problem as AWS, there's no protection against an entire region going down without *synchronously* replicating to another region, which introduces too much latency. You can use persistent disks to get around that, but they're slower, limiting both throughput and latency. Dedicated servers have a massive advantage here, they're usually cheaper, more powerful, and they don't go anywhere in the event of a power failure. — Eloff, Jan 08 '16 at 14:05
@Eloff — nothing is perfect, there are always trade-offs. Dedicated servers will reboot in-place, but in the event of a hardware failure, you have no immediate replacement for that dedicated server except for provisioning more physical dedicated servers. And you have to pay for the physical servers up-front and own them, whereas you pay for as much as you use with cloud VMs, on a per-minute or per-hour granularity, with ability to quickly grow or shrink your deployment footprint. — Misha Brukman, Jan 08 '16 at 16:42
Yes, you're right about the tradeoffs. However, for the price I can get about 4x the dedicated servers as cloud servers, so I could literally have 4x spare capacity just sitting idle at no extra cost. That should cover hardware failures and traffic spikes. You'd have to have massive variability of load, e.g. netflix, for the public cloud to make sense. — Eloff, Jan 09 '16 at 13:07
@Eloff — we're getting off-topic from the original question, and I'm happy to discuss it further with you, but just to be clear: as you can see from my profile, I'm not entirely unbiased. :-) If you want to continue the conversation, please feel free to reach out to me via LinkedIn. That said, please consider the [TCO](https://en.wikipedia.org/wiki/Total_cost_of_ownership) rather than just baseline hardware cost, e.g., does the dedicated server include cost of space, power, and cooling? What about cost to replace a broken disk or machine (hardware + labor)? What about the data on the disk? — Misha Brukman, Jan 10 '16 at 05:00

What happens to Local SSD if the entire zone were to lose power?

1 Answers1

Powering our data centers