Cost effective way to retrieve large files from S3 bucket in a different region

Question

We have a 3x daily recurring automated task to download and restore a full backup of a large DB from an EU S3 bucket to our on-prem server in the US. This was setup when the DB itself was small and transfer time/costs were minimal. Due to factors outside our control, the DB is now 70+ GB. Full backups are taken every 3 days, with diffs taken every 8 hours. Our 3x daily automated task necessitates pulling the .bak files of the most recent full and diff. At download speeds of about 120 Mbps during the day, it can take several hours to pull this from S3, and at 3x daily, 365 days a year, for $.09/GB transfer out of S3, the transfer costs alone are non-trivial.

There seem to be plenty of options to minimize cost and runtime here.

We could cache the full .bak files locally and check to see if the file already exists locally before pulling it from the EU S3 bucket. There is only 1 full .bak file every 3 days, yet there are 9 restores, so 8 out of 9 of these could use a cached copy.
We could additionally alter our backup strategy to take less frequent full backups, so that the full .baks would be downloaded even less often.
Transfer costs from S3 to another AWS service within the same region are free, so if we were able to restore this DB to an EC2 instance in EU that would be great, but the teams using the restored DBs currently need them hosted on-prem, so that's a long term thought.
We could proactively replicate this DB from the EU bucket to a US bucket, and download from there, but this would double our storage costs, and transfer out of S3 costs the same regardless of region.
The AWS backbone is faster than the internet, and S3 to CloudFront is free, so in theory we could privately access these files through CloudFront which would provide an edge location with faster speeds, plus CloudFront is also slightly less expensive than S3 at $.085/GB. This seems like a lot of engineering work for a small cost savings though. Our codebase is C#, and we're currently getting files using the AWS SDK for S3 - I haven't looked into how this could work with CloudFront (and I feel like I'm probably missing something here).

My plan is to implement a local cache (#1), which is a code-based solution on our side. This seems like a kind of common use case though, and I'm wondering if I'm missing something obvious.

What is your _actual_ goal? Is it a backup in case the database becomes corrupted, or is it to have an on-premises database replica? Backups do not need to be replicated to on-premises. Replication does not need a complete copy of the database to be transferred. These are actually different requirements that would each have an optimal method, rather than trying to combine the two. — John Rotenstein, Jan 11 '19 at 11:07
@JohnRotenstein You're right - there are two independent goals. One is a backup of the DB for recovery purposes, and the other is an on-prem recent copy of DB. — Nathan, Jan 11 '19 at 16:15
Where is the actual database running -- is it in AWS? A backup might be useful 3 times a day, but this does not necessarily require a FULL backup each time. Also, it doesn't necessitate a _restore_ 3 times a day unless your requirement is to have a running Hot Backup rather than the mere _ability_ to restore. As for the recent copy, that clearly doesn't need to be done 3 times each day. I'd suggest clarifying each requirement (at least two requirements, perhaps more), then implementing the best method that meets those particular requirements. — John Rotenstein, Jan 11 '19 at 21:43

Cost effective way to retrieve large files from S3 bucket in a different region

0 Answers0