AWS S3 Backup Strategies - How should I approach backing up S3 buckets?

Question

I am in the process of building a web-app with potential for a massive amount of storage requirements which can be satisfied by amazon S3.

My main concern is the usage of API keys on the server, and how an unauthorised person could exploit the server in some way, obtain the keys, and use them to destroy all the data in the S3 buckets.

What strategies should I put in place to minimise the potential exposure of my API keys?
What would be a robust approach for back-up of terrabytes of S3 assets given a restrictive budget?

EEAA · Accepted Answer · 2011-07-12T13:54:02.510

The first thing that comes to mind is the fact that data transfer in and out of S3 is quite spendy. If you're backing up frequently (as you ought to be), costs could get out of hand just with transfer fees. That said, to answer your question, backups should be performed from a separate, hardened, server whose only task in life is to perform backups. No apache, remote access only via SSH with key authentication, etc. If you do these things along with ensuring that only a select few people have access to the server, your keys should be quite safe. If you are really paranoid, you can pgp-encrypt the file that contains your keys - the problem with this approach, though, is that it requires you to enter your passphrase each time the backup job runs. That's probably not something you want to sign up for, correct?

After hearing about your restrictive budget, I can't help but think that it would be better for you to change around your storage strategy. I'm not sure what your server situation is, but could you perhaps host the files locally on the server and then just use S3 for backups? There is a great backup script called duplicity that can perform compressed, encrypted, incremental backups to S3 (among several other backend storage types).

[Edit] If you do end up hosting on S3 and backing up to local disk, it looks like there's an "If-Modified-Since" header in the S3 API that will help with performing incremental backups. For backups like this, you're most likely going to need to homebrew something, though it won't be too difficult. Just use SimpleDB/BerleleyDB/etc to store meta information about which files you have backed up along with a pointer to where they reside on disk. Keeping the meta information in a DB will also make quick work out of verifying backups as well as creating reports on backup jobs.

My primary method for storing and serving the assets would be via S3, primarily because I want to offload scalability issues solely related to serving static files, and I'd also make use of it as a CDN. My thoughts so far are to push assets to S3, and then somehow signal a hardened remote backup server to pull the file to its local filesystem (probably using ZFS for snapshotting), with scheduled archiving to an attached backup medium. — Matt, Oct 11 '09 at 22:13
In that case, when you push the asset to S3, you should get back the object ID. You can use some sort of message queue system to store info about what needs to be backed up. Upon each S3 upload, push an object onto the queue with object ID, upload time, etc. Then from the backup server, you can periodically ready that queue and back up all the needed files. As long as you're using AWS, you could even consider using their SQS service for this. — EEAA, Oct 11 '09 at 22:29
A message queue is a spot on idea. I think I have the solution fully in mind now. Thank you. — Matt, Oct 11 '09 at 22:50

score 0 · Answer 2 · answered Oct 03 '12 at 10:57

Even I had the same issue, what I did is I wrote a simple bash script to do this for me, but I works fine in a single region, it doesnt work with multiple regions, here is the script http://geekospace.com/back-up-and-restore-the-database-between-two-aws-ec2-instances/

AWS S3 Backup Strategies - How should I approach backing up S3 buckets?

2 Answers2