2

I'm setting up a processing pipeline for CloudFront log files. Reading the documentation, my understanding is that CF will create one log file per hour per distribution, but that's not what I'm seeing in my bucket. I get multiple files per distro (per hour):

E39O6KS6J8MIZW.2015-10-09-23.083b2c12.gz 
E39O6KS6J8MIZW.2015-10-09-23.1a96bb61.gz
E39O6KS6J8MIZW.2015-10-09-23.4cd34dd8.gz 
E39O6KS6J8MIZW.2015-10-09-23.50c7b5b1.gz

What am I missing? Basically, what I'm trying to understand is what drivers creation of new log files.

Dmitry B.
  • 123
  • 5

1 Answers1

3

CloudFront, as you likely know, is a globally-distributed system where provisioning is centralized, but the 50+ edge locations operate independently once provisioning is pushed out to them.

The logs are, presumably, collected either locally at each edge, or regionally, and then periodically collected and assembled into consolidated logs and published to your log bucket.

The timestamp embedded in the log file name represents, approximately, the hour during which the contained events occurred. As such, the log for a given hour will often not arrive during an hour, or even in the hour immedately following.

If anything prevents the logs from certain edges from being collected in a timely fashion (as would be expected in a global, distributed platform), they will normally arrive within a few hours, in a back-dated log file that represents the approximate time the logs were originally recorded.

Timing of Log File Delivery

CloudFront delivers access logs for a distribution up to several times an hour. In general, a log file contains information about the requests that CloudFront received during a given time period. CloudFront usually delivers the log file for that time period to your Amazon S3 bucket within an hour of the events that appear in the log. Note, however, that some or all log file entries for a time period can sometimes be delayed by up to 24 hours. When log entries are delayed, CloudFront saves them in a log file for which the file name includes the date and time of the period in which the requests occurred, not the date and time when the file was delivered.

http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/AccessLogs.html#access-logs-timing

So, essentially, CloudFront will create at least one log file for each hour that your distribution has any traffic, but a log can arrive at essentially any time... so you can't effectively poll the bucket looking for certain patterns based on the current time, the time of the previous hour, etc.

One way to handle these as expeditiously as possible (without polling the bucket) is S3 event notifications.

In any event, you do need to be prepared to handle any timestamp, whenever it's written, don't assume duplications, and don't disregard a log because it has a timestamp that seems older than expected.

Michael - sqlbot
  • 22,658
  • 2
  • 63
  • 86
  • I'm looking at log files that were collected several months ago and seeing a large variance in the number of files per day (min: 3000 files; max: 21000 files). And it's not an continuous trend either. It looks cyclical with a spike every 5 weeks. So this variance would be due to some "environmental" conditions that affect how efficiently log files can be collected from edges? – Dmitry B. Nov 11 '15 at 20:53
  • Anecdotally, I'd say if it isn't related to your traffic patterns, "environmental" conditions would be the most likely explanation, particularly when the `Last-Modified` value in S3 is further separated in time from the timestamp embedded in the log file name. There are probably several internal, undocumented factors affecting log delivery and grouping, but notable variation in number of logs per hour is normal behavior, by my observations. – Michael - sqlbot Nov 11 '15 at 22:59
  • The total log volume per day (in bytes) is on a steadily increasing trend, so I assume that traffic patterns aren't a factor (I haven't had a chance to look at the actual request counts). I'm assuming this is a normal CF behavior, but need to understand the cause as I'm sure I'l be asked for an explanation once management sees the charts. Good tip on `Last-Modified`, I'll take a look at that. Thanks, Michael. – Dmitry B. Nov 11 '15 at 23:48