Improving I/O rates for EBS snapshot backed volumes

Question

I'm working with a system that takes a set of 42 rotating daily EBS snapshots of each of its numerous (40) volumes for disaster recovery purposes. EBS volumes are aggregated into a RAID volume. A set of consistent snapshots are taken by freezing the filesystem for the duration of taking the snapshots. Each individual volume is only 2 TiB.

During DR testing it has been found that it takes well over 24 hours to copy the 20+ TiB of application data (PostgreSQL database, many large tables) out of EBS-snapshot-backed volumes created from the snapshots and onto fresh non-snapshot-backed volumes. That's with considerable parallelism in the copy thanks to 8 rsyncs working at once on different sub-trees.

If the data isn't copied to fresh EBS volumes then the PostgreSQL-based application runs like a fly in honey for many days, presumably until the EBS volume's blocks have been dirtied so they're now directly on the EBS volume, not coming from the snapshot.

By contast, a copy of the same data from one set of non-snapshot-backed EBS volumes to another takes only a few hours, and doing it with "real" hardware of a similar scale takes much less again.

Why would I be seeing such extreme performance differences between snapshot-backed volumes and plain volumes?

My hypothesis is that it's doing copy-on-write, so clean blocks that're unchanged since the snapshot must be fetched separately. If there's a stack of 40 snapshots backing the volume then it's presumably having some difficulty quickly locating the block in the most recent snapshot it appears in and fetching it.

Is there any way to force AWS to efficiently and linearly pre-populate the whole new EBS volume from the snapshot, rather than doing lazy copy-on-write as it actually appears to do?

Any other ideas for working around this? A set of snapshots for DR is dramatically less useful if recovery takes more than a day.

The most I've found so far is http://stackoverflow.com/q/3904678/398670 which is less than entirely specific. In particular it's unclear to me if simply reading the snapshot-backed volume in-place is sufficient, or if I have to copy-to-self or copy to another volume. — Craig Ringer, Apr 13 '16 at 05:53

Michael - sqlbot · Accepted Answer · 2019-12-28T15:12:36.657

Reading from the restored volume should be sufficient.

When you create a volume from an existing snapshot, it loads lazily in the background so that you can begin using them right away. If you access a piece of data that hasn't been loaded yet, the volume immediately downloads the requested data from Amazon S3, and then continues loading the rest of the volume's data in the background.

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSSnapshots.html

Anecdotally, it seems like sequential "forced-reading" using dd performs better than the more random reads that would result from reading from the filesystem, but you can of course do both at the same time -- go ahead and mount it and start doing whatever you need to so, but also read and discard from the block device with dd.

This apparent difference would make sense, particularly if the EBS snapshot infrastructure doesn't actually store the snapshot blocks in "block"-sized (4096 byte) chunks. It seems like that would be a pretty inefficient design, requiring thousands of operations for each megabyte.

It might further improve restoration if you did multiple, sequential reads starting at different offsets. Untested, but gnu dd can apparently "skip" blocks and begin reading other than from the beginning.

But, you definitely don't need to create "fresh" volumes. Once the blocks are loaded by a read, they're "in" EBS and not from the snapshot.

If there's a stack of 40 snapshots backing the volume then it's presumably having some difficulty quickly locating the block in the most recent snapshot it appears in and fetching it.

It shouldn't really matter how many snapshots were backing it. The data isn't stored "in" the snapshots. Each snapshot contains the complete record of what I'll casually call "pointers" to all of the data blocks comprising it (not just the changed ones) and presumably where they stored in the backing store (S3) that's used by the snapshot infrastructure.

If you have snapshots A, B, and C taken in order from the same volume, and then you delete snapshot B, all of the blocks that changed from A to B but not from B to C are still available for restoring snapshot C, but they are not literally moved from B to C when you delete snapshot B.

When you delete a snapshot, EBS purges the backing store of blocks no longer needed using reference counting. Blocks that are not referenced by any snapshot are handled in the background by a multi-step process that first flags them as not needed, which stops billing you for them, and then actually deletes them a few days later when the fact that they are genuinely at refcount = 0 has been confirmed. Source.

Because of this, the number of snapshots that originally contributed blocks to your restored volume should not have a reason to impact performance.

Additional, possibly useful info: the following does not change the accuracy of the answer, above, but might be of value in certain situations.

In late 2019, EBS announced a new feature called Fast Snapshot Restore that allows volumes created in designated availability zones from designated snapshots to be instantly hot with no warmup required.

Using a credit bucket and based on the size the designated snapshot (that is, the size of the disk volume it was taken from) -- not the size of the target volume (which can be larger than that of the snapshot) -- this feature allows you to create 1024G/size volumes per hour, so a 128 GiB snapshot could create 8 pre-warmed volumes per hour. As snapshots get smaller, the number of volumes you can create per hour per snapshot per availability zone is capped at 10.

The service is also startlingly expensive -- $0.75 per hour, per snapshot, per availability zone (!?) -- however, this may not be something you would need to leave running continuously, and in that light it seems to have some potential value.

When you activate the feature, the service API can tell you when it's actually ready to use, and 60 minutes per TiB is the stated timetable for "optimizing a snapshot" (which, reading between the lines, means building and warming up a hidden primary volume inside EBS from the snapshot, which will subsequently be cloned by the service to create additional EBS volumes; the feature appears to only actually be usable after this stage is complete and volumes created from the same snapshot before that point are just ordinary volumes).

As long as you have time to wait for the "optimizing" stage, and processes in place to terminate the fast restore behavior when you no longer need it (to avoid a very large unexpected billing charge), this does seem to have applicability in limited use cases.

Thankyou. That is exceedingly helpful, and I really appreciate the detail. The `dd` option you're thinking of is `skip=` and takes an argument in units of blocks, where each block is of size `bs=` (default 512 bytes, I think). — Craig Ringer, Apr 13 '16 at 15:02

Improving I/O rates for EBS snapshot backed volumes

1 Answers1