Schedule, Compress, and Transfer files from EC2/EFS to Glacier

Question

I want to create a process to transfer files from EC2 / EFS, to Glacier - but with compression. Say there are directories with timestamps down to the hour. Every hour, I want a process that checks for directories older than 24 hours (configured), will zip-up the files in that directory, and move the zip-file to Glacier (and both the files and zipfile, deleted). Plus, high-reliability, some kind of failure/retry logic. And ideally, that uses an existing tool, or doesn't require a lot of external coding/logic.

I've found a lot of tools that almost do this:

AWS DataSync - moves files reliably - but no option to add compression
AWS DataPipeline - transfers files with logic - but doesn't support EFS? (Or Glacier, but I suppose I could move the files to S3, with a transfer to Glacier).
some hybrid solution, like
- AWS DataSync with a cronjob that does the zip-file - but what about retries?
- AWS StepFunction Workflows running a Task on the EC2 box where EFS is mounted

One tool that I'm fairly sure would do it, is Apache-AirFlow, which does workflows - but that requires a lot of manual coding, and I'm not sure if AWS StepFunctions would be the same result anyway.

It seems like this should be a solved-problem - schedule and compress a directory of files, move it to Glacier (with retry-logic) - but I haven't found any really clean solutions yet. Is there something I'm missing?

@Tim, I don't necessarily disagree with that edit - but is it possible that EFS has any differences from EBS? It isn't actually EBS here. Maybe just saying EC2 filesystem is enough, I dunno. — Cyclops, Oct 23 '20 at 20:16
Please edit your question so it's precise. You said in your question"(Maybe I shouldn't say EFS - it's really just a filesystem on an EC2 server)." which implied it's EBS. If it's really EFS which is mapped to an EC2 instance please say that rather than putting in ambiguous comments. — Tim, Oct 24 '20 at 00:04
Good point. Yes it's a mounted EFS system, but I was just thinking of it as a filesystem. Will correct it. — Cyclops, Oct 24 '20 at 02:05

Tim · Answer 1 · 2020-10-24T00:21:48.863

You have told us your planned methods, but not your problem or aims, which will limit the advice you'll get. Are you archiving and trying to save money? Do you have compliance objectives?

AWS Glacier service, as opposed to the storage class, is really only useful for enterprise compliance needs. S3 with glacier / deep archive storage class is sufficient in most cases. AWS Glacier doesn't even have the cheaper deep archive storage class.

Storage is cheap. I suggest you simply create an AWS S3 lifecycle rule that moves an object to S3 glacier deep archive storage class when it's 24 hours old. That won't do any compression, but at about $1/TB/month it might not be worth the trouble of compression unless you're doing really high data volumes with easily compressible files.

If you really need compression this would be a fairly simple lambda script. Your lambda searches your S3 bucket for files over 24 hours old, then for scalability starts another lambda for each file to compress it, then copy it to another S3 bucket with deep archive storage class.

Update

The latest information is it's about GB of data per hour. That's 720GB per month, after a year 8.6TB. 8.6TB of storage in S3 deep archive class is about $100 a month, which is nothing really if you're having to pay engineers to design, implement, and support a system. This will add up each year, but if you can have a lifecycle rule to delete data after a year it will limit costs.

AWS Glacier is not as flexible as S3 glacier / deep archive storage class. You can't use lifecycle rules, it doesn't have deep archive. It's really a product for huge enterprises with strict compliance requirements.

Option One If you can live without compression your suggest of data sync might work - I know nothing about it, there are so many services in AWS. If it can collect a file from EFS and put into S3 in deep archive class then the job is done, cheaply.

Option Two If your data is highly compressible and reducing the cost from $100 to $30 a year matters then you could have a lambda fetch the data, do the compression, and write to S3 deep archive. You wouldn't need the multiple steps you described.

Pretty much all of what you suggested. We'll have incoming hourly data in the gigabyte-range (potentially), and don't want to keep it on EFS longer than necessary, for cost reasons. We will need to keep all data for dispute-purposes, so I need failure/retry on the entire transfer. You've definitely raised some interesting points. `DataSync` would do the transfer to S3, minus the compression. Then possibly Lambdas to compress, and another rule to move the zipfile to Glacier. Seems like the Lambdas would be potentially expensive, as opposed to the EC2 box itself doing the compression, though. — Cyclops, Oct 23 '20 at 20:36
I've updated my answer based on new information. If this doesn't answer your question please edit it so it's complete and has all the information you have, handing out information bit by bit makes it time consuming to help. — Tim, Oct 24 '20 at 00:22
Have a look at this article https://aws.amazon.com/premiumsupport/knowledge-center/datasync-transfer-efs-s3/ — Tim, Oct 24 '20 at 00:42

Schedule, Compress, and Transfer files from EC2/EFS to Glacier

1 Answers1