5

I have a very flat S3 bucket. Here is how the S3 object keys in that bucket look like:

  • s3-access-logs/2017-11-03-00-22-36-05A50CD782AE8AE0
  • s3-access-logs/2017-11-03-00-24-21-F14ED1FF6C315431

As you can see I have only one S3 folder "s3-access-logs" with ALL the objects under that folder. In fact this S3 bucket contains the S3 access log for a different S3 bucket.

I want to run some analysis on these S3 access log (using Athena). Athena allows me to either:

  1. Create an Athena Table using the S3 bucket as location, or
  2. Create an Athena Table with partitioning turned on and I can add a partition using an s3 prefix.

I only care about the access log for a certain date so I want to avoid scanning the entire S3 bucket (which I tried and the query never completed after more than 15 min). I would like Athena to only scan files for that date. I noticied that Athena is OK with using "s3-access-logs" as the S3 location / prefix, but Athena does not seem to support using "s3-access-logs/2017-11-03" as the S3 location / prefix.

Is it true that Athena only support "S3 Folder" as prefix or location (i.e. the prefix string must end with a slash), but not any random string in the s3 object key prefix? If so is there any workaround for this issue?

Thanks!

Erben Mo
  • 3,528
  • 3
  • 19
  • 32

2 Answers2

0

The underlying filesystem abstractions used by distributed query engines like Athena and Spark require table and partition locations to look like directories in a traditional filesystem. Since Athena is a fully managed service there is also no way to configure it to use a different separator than /.

When Athena runs a query it will list all the files in the table or partition prefix(es). Therefore workarounds like filtering with $path don't work in this scenario – all the files have to be listed before the filtering happens, and it's the listing that takes time.

If you can move the files into one prefix ending with / per day that will work best with Athena. One way I've done this in the past is to configure a Lambda function to run when a new file is written and for the function to copy the file to a new key by replacing the - after the date with a / (and then deleting the original file).

If you for some reason can't move the files, there is also SymlinkTextInputFormat. This requires creating a separate directory structure with manifest files that list all the files for each partition. For example, logs/2017-11-03/symlink.txt with the full URI of each file for that date, and so on for each date.

Besides the added complexity of maintaining the manifests, it also adds query latency since Athena needs to make multiple rounds of file listings. Compared to listing potentially hundreds of thousands of files in a flat namespace like S3 access logs it's a lot less latency, though.

You can read more about how to use SymlinkTextInputFormat here: https://athena.guide/articles/stitching-tables-with-symlinktextinputformat/

Theo
  • 131,503
  • 21
  • 160
  • 205
-2

Looks like there is datetimestamp in your log file name, you can use "$path" to scan only selected file in Athena.

Select * from yourTableName where "$path" like '2017-11-03%'

Aftab Ansari
  • 926
  • 9
  • 17
  • 1
    This sounds like it would work, but [unfortunately, Athena does not do any optimization for limiting files scanned when conditioning on `$path`](https://stackoverflow.com/questions/63746445/does-path-limit-the-amount-of-data-scanned-by-athena). – Semafoor May 12 '22 at 09:51