1

Following AWS-recommended best practices, we have organization-wide CloudTrail and VPC flow logging configured to log to a centralized logs archive account. Since CloudTrail and VPC flow are organization-wide in multiple regions, we're getting a high number of new log files saved to S3 daily. Most of these files are quite small (several KB).

The high number of small log files is fine while they're in the STANDARD storage class, since you just pay for total data size without any minimum file size overhead. However, we've found it challenging to deep archive these files after 6 or 12 months, since any storage class other than STANDARD (such as GLACIER) has a minimum billable file size (STANDARD-IA is 128, GLACIER doesn't have a minimum size but adds 40KB of metadata per object, etc.).

What are the best practices for archiving a large number of small S3 objects? I could use a Lambda to download multiple files, re-bundle them into a larger file, and re-store it, but that would be pretty expensive in terms of compute time and GET/PUT requests. As far as I can tell, S3 Batch Operations has no support for this. Any suggestions?

Jordan
  • 3,998
  • 9
  • 45
  • 81
  • This is a really great question and something that I have run into in the past. Not only is this an issue for archival but if you want to query this data with Athena, querying tiny files is much less performant than querying larger files. One additional question you want to ask yourself is, what are your data retention policies? Do you really need to keep raw VPC Flow Logs around for > 12 months? – JD D Jul 31 '21 at 23:06

1 Answers1

1

Consider using a tool like S3-utils concat. This is not an AWS-supported tool but an open source tool to perform the type of action you are requiring.

You'll probably want the pattern matching syntax which will allow you to create a single file for each day's logs.

$ s3-utils concat my.bucket.name 'date-hierachy/(\d{4})/(\d{2})/(\d{2})/*.gz' 'flat-hierarchy/$1-$2-$3.gz'

This could be run as a daily job so each day is condensed into one file. Definitely recommended to run this in a resource on the Amazon network (i.e. your VPC with the s3 gateway endpoint attached) to improve file transfer performance and avoid data transfer out fees.

JD D
  • 7,398
  • 2
  • 34
  • 53