13

Recently, I've experienced an issue with AWS Athena when there is quite high number of partitions.

The old version had a database and tables with only 1 partition level, say id=x. Let's take one table; for example, where we store payment parameters per id (product), and there are not plenty of IDs. Assume its around 1000-5000. Now while querying that table with passing id number on where clause like ".. where id = 10". The queries were returned pretty fast actually. Assume we update the data twice a day.

Lately, we've been thinking to add another partition level for day like, "../id=x/dt=yyyy-mm-dd/..". This means that partition number grows xID times per day if a month passes and if we have 3000 IDs, we'd approximately get 3000x30=90000 partitions a month. Thus, a rapid grow in number of partitions.

On, say 3 months old data (~270k partitions), we'd like to see a query like the following would return in at most 20 seconds or so.

select count(*) from db.table where id = x and dt = 'yyyy-mm-dd'

This takes like a minute.

The Real Case

It turns out Athena first fetches the all partitions (metadata) and s3 paths (regardless the usage of where clause) and then filter those s3 paths that you would like to see on where condition. The first part (fetching all s3 paths by partitions lasts long proportionally to the number of partitions)

The more partitions you have, the slower the query executed.

Intuitively, I expected that Athena fetches only s3 paths stated on where clause, I mean this would be the one way of magic of the partitioning. Maybe it fetches all paths

  • Does anybody know a work around, or do we use Athena in a wrong way ?
  • Should Athena be used only with small number of partitions ?

Edit

In order to clarify the statement above, I add a piece from support mail.

from Support

... You mentioned that your new system has 360000 which is a huge number. So when you are doing select * from <partitioned table>, Athena first download all partition metadata and searched S3 path mapped with those partitions. This process of fetching data for each partition lead to longer time in query execution. ...

Update

An issue opened on AWS forums. The linked issue raised on aws forums is here.

Thanks.

null
  • 1,944
  • 1
  • 14
  • 24
  • 2
    Did you already consider bucketing? – Piotr Findeisen Dec 26 '19 at 12:30
  • 1
    @PiotrFindeisen Do you mean bucketing days instead of partitioning days ? I've not tried that, but would it speed up where clause? If you meant to get the optimal number of files, you can assume that we have optimal number of files in each partition – null Dec 26 '19 at 16:54
  • 2
    I don't know your query patterns (this is the key part, really). Intuitively I'd try partitioning by `dt` and bucketing by `id` first. However, I don't know why you partitioned by `id` and what `id` actually is. Also, there is no such thing as optimal number of files. If you use ORC or Parquet, you just care about the files being at least say 32-64MB, but individual files can be quite big. – Piotr Findeisen Dec 26 '19 at 21:17
  • 1
    BTW as you see this is not a simple question that fits SO well, and there is no single answer. I recommend you consult Presto experts at [Presto community slack](https://prestosql.io/slack.html). – Piotr Findeisen Dec 26 '19 at 21:19
  • 1
    "It turns out Athena first fetches the all partitions and s3 paths", how did you figure this out? AWS documentation specifically states that "If you query a partitioned table and specify the partition in the WHERE clause, Athena scans the data only from that partition". You can verify this in the query history tab where it shows the amount of actual data scanned for the select query. – Harsh Bafna Dec 27 '19 at 06:44
  • 1
    Also, a lot will depend on actual data, daily data size, format of data files stored on S3 and what type of SQL queries you are executing. It will be nice if you could actually post some numbers on the time taken for similar queries before and after partitioning by "dt" column. – Harsh Bafna Dec 27 '19 at 06:47
  • 1
    @PiotrFindeisen thanks for the comment, the reason we partition by `id` first is due to the query patterns of ours as you said, we 95% of the time work with `id`s, but we thought that it'd be good if we add days in order to get faster query results with both stating `id` and `dt`. We're storing the files as parquet. – null Dec 27 '19 at 07:00
  • 1
    @HarshBafna we first intuitively thought about it since our queries were started to be got slower. Then, we contacted with the support, and they stated the same thing we thought. As you mentioned ""If you query a partitioned table and specify the partition in the WHERE clause, Athena **scans** the data only from that partition". The key part here is "scan", Athena first fetches all s3 paths on partition declaration to be able to scan the specified paths only. For that it query to its own "internal schema". This part gets the query slower, not the scanning. – null Dec 27 '19 at 07:05
  • 2
    @null : This might be a helpful article for your usecase : https://aws.amazon.com/premiumsupport/knowledge-center/athena-query-time-out-many-partitions/ – Harsh Bafna Dec 27 '19 at 08:24
  • 1
    @HarshBafna thanks, the article was related, it was a good shot. In our case, `id` was integer and `dt` partition was string, I changed the `id` to string, but no changes unfortunately :( – null Dec 27 '19 at 10:12
  • 1
    Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/204929/discussion-between-null-and-harsh-bafna). – null Dec 27 '19 at 10:14

1 Answers1

6

This is impossible to properly answer without knowing the amount of data, what file formats, and how many files we're talking about.

TL; DR I suspect you have partitions with thousands of files and that the bottleneck is listing and reading them all.

For any data set that grows over time you should have a temporal partitioning, on date or even time, depending on query patterns. If you should have partitioning on other properties depends on a lot of factors and in the end it often turns out that not partitioning is better. Not always, but often.

Using reasonably sized (~100 MB) Parquet can in many cases be more effective than partitioning. The reason is that partitioning increases the number of prefixes that have to be listed on S3, and the number of files that have to be read. A single 100 MB Parquet file can be more efficient than ten 10 MB files in many cases.

When Athena executes a query it will first load partitions from Glue. Glue supports limited filtering on partitions, and will help a bit in pruning the list of partitions – so to the best of my knowledge it's not true that Athena reads all partition metadata.

When it has the partitions it will issue LIST operations to the partition locations to gather the files that are involved in the query – in other words, Athena won't list every partition location, just the ones in partitions selected for the query. This may still be a large number, and these list operations are definitely a bottleneck. It becomes especially bad if there is more than 1000 files in a partition because that's the page size of S3's list operations, and multiple requests will have to be made sequentially.

With all files listed Athena will generate a list of splits, which may or may not equal the list of files – some file formats are splittable, and if files are big enough they are split and processed in parallel.

Only after all of that work is done the actual query processing starts. Depending on the total number of splits and the amount of available capacity in the Athena cluster your query will be allocated resources and start executing.

If your data was in Parquet format, and there was one or a few files per partition, the count query in your question should run in a second or less. Parquet has enough metadata in the files that a count query doesn't have to read the data, just the file footer. It's hard to get any query to run in less than a second due to the multiple steps involved, but a query hitting a single partition should run quickly.

Since it takes two minutes I suspect you have hundreds of files per partition, if not thousands, and your bottleneck is that it takes too much time to run all the list and get operations in S3.

Theo
  • 131,503
  • 21
  • 160
  • 205
  • 1
    Thanks for the detailed answer. Indeed as you have stated we have saken for an optimal file size (storing parquet), but the bottleneck was s3 listing, and it is probably due to pagination for s3 list as 1000 page. Our problem was applying two partition columns on glue, and it is not a best practice, so we have changed the table structure to be applied 1 partition column per table, it solved the issue pretty much. – null Mar 29 '20 at 10:26