Update in 2023: Athena now supports skipping files based on predicates on $path
Currently Athena does not apply any optimisations for $path
, which means that there is no meaningful difference between WHERE "$path" LIKE '%12-01%
and WHERE "date" = '2020-12-01'
(assuming you have a column date
which contains the same date as the file name). Your data probably already has a date or datetime column, and your queries will be more readable using it rather than $path
.
You are definitely on the right track questioning whether or not you need the date part of your current partitioning scheme. There are lots of different considerations when partitioning data sets, and it's not easy to always say what is right without analysing the situation in detail.
I would recommend having some kind of time-based partition key. Otherwise you will have no way to limit the amount of data read by queries, and they will be slower and more expensive as time goes. Partitioning on date is probably too fine grained for your use case, but perhaps year or month would work.
However, if there will only be data for a client for a short time (less than one thousand files in total, the size of one S3 listing page), or queries always read all the data for a client, you don't need a time-based partition key.
To do a deeper analysis on how to partition your data I would need to know more about the types of queries you will be running, how the data is updated, how much data files are expected to contain, and how much difference there will be from client to client.