1

I'm storing daily reports per client for query with Athena.
At first I thought I'd use a client=c_1/month=12/day=01/ or client=c2/date=2020-12-01/ folder structure, and run MSCK REPAIR TABLE daily to make new day partition available for query.

Then I realized there's the $path special column, so if I store files as 2020-12-01.csv I could run a query with WHERE $path LIKE '%12-01% thus saving a partition and the need to detect/add it daily.

I can see this having an impact on performance if there was a lot of daily data, But in my case the day partition will include one file at most, so a partition is mostly to have a field to query, not reduce query dataset.

Any other downside?

Nihil
  • 146
  • 1
  • 5

2 Answers2

1

When using $path column, all table (partition) location needs to be fully listed. if you have large number of objects in S3, this listing can become a bottleneck. Partitions avoid this problem.

Of course, having large number of partitions is also a problem. I don't know what the cardinality of client column, so hard to tell how many partitions to expect with this approach.

Piotr Findeisen
  • 19,480
  • 2
  • 52
  • 82
  • If I remember correctly, in Athena `$path` is not optimised in any way – not only will all locations be listed, all files will be read. – Theo Jun 15 '20 at 15:03
  • Thanks @Theo for pointing this out. In this particular case, it probably doesn't impact the end state: all partitions need to be enumerated. – Piotr Findeisen Jun 16 '20 at 16:11
1

Update in 2023: Athena now supports skipping files based on predicates on $path

Currently Athena does not apply any optimisations for $path, which means that there is no meaningful difference between WHERE "$path" LIKE '%12-01% and WHERE "date" = '2020-12-01' (assuming you have a column date which contains the same date as the file name). Your data probably already has a date or datetime column, and your queries will be more readable using it rather than $path.

You are definitely on the right track questioning whether or not you need the date part of your current partitioning scheme. There are lots of different considerations when partitioning data sets, and it's not easy to always say what is right without analysing the situation in detail.

I would recommend having some kind of time-based partition key. Otherwise you will have no way to limit the amount of data read by queries, and they will be slower and more expensive as time goes. Partitioning on date is probably too fine grained for your use case, but perhaps year or month would work.

However, if there will only be data for a client for a short time (less than one thousand files in total, the size of one S3 listing page), or queries always read all the data for a client, you don't need a time-based partition key.

To do a deeper analysis on how to partition your data I would need to know more about the types of queries you will be running, how the data is updated, how much data files are expected to contain, and how much difference there will be from client to client.

Theo
  • 131,503
  • 21
  • 160
  • 205
  • Thanks @Theo, the problem is my data does not have a 'date' or 'day' column. I'm trying to create one either by using a `day=` partition, or by saving to s3 with day in the file name and filtering by `$path`. But I think you're right - without a partition there's no way to limit the amount of data read. Even though I have just one file per 'day' partition - it will grow over time, incurring higher costs and slowing queries. Maybe the best solution is to use a `month=` partition, and query specific `day` with `$path`, that way I get the best of both in a way.. – Nihil Jun 27 '20 at 08:47
  • Monthly partitions is often a good trade-off. Worst case you read 30x the amount of data you need, but if queries are often for longer periods like weeks, the difference isn't too much to worry about in many cases. I would recommend not partitioning like `year=2020/month=06`, though, instead do `month=2020-06` or `month=2020-06-01`. It's going to be much easier to work with (especially now with the new Partition Projection feature). – Theo Jun 29 '20 at 14:17