1

The main question:

I can't seem to find definitive info about how $path works when used in a where clause in Athena.

select * from <database>.<table> where $path = 'know/path/'

Given a table definition at the top level of a bucket, if there are no partitions specified but the bucket is organized using prefixes does it scan the whole table? Or does it limit the scan to the specified path in a similar way to partitions? Any reference to an official statement on this?

The specific case:

I have information being stored in s3, this information needs to be counted and queried once or twice a day, the prefixes are two different IDs (s3:bucket/IDvalue1/IDvalue2/) and then the file with the relevant data. On a given day any number of new folders might be created (on busy days it could be day tens of thousands) or new files added to existing prefixes. So, maintaining the partition catalog up to date seems a little complicated.

One proposed approach to avoid partitions is using $path when getting data from a know combination of IDs, but I cannot seem to find whether using such approach would actually limit the amount of data scanned per query. I read a comment saying it does not but I cannot find it in the documentation and was wondering if anyone knows how it works and can point to the proper reference.

So far googling and reading the docs has not clarified this.

1 Answers1

7

Update in 2023: Since I wrote the answer below Athena has added support for only reading files that match predicates on $path.

Athena does not have any optimisation for limiting the files scanned when using $path in a query. You can verify this for yourself by running SELECT * FROM some_table and SELECT * FROM some_table WHERE $path = '…' and comparing the bytes scanned (they will be the same, if there was an optimisation they would be different – assuming there is more than one file of course).

See Query by "$path" field and Athena: $path vs. partition


For your use case I suggest using partition projection with the injected type. This way you can limit the prefixes on S3 that Athena will scan, while at the same time not have to explicitly add partitions.

You could use something like the following table properties to set it up (use the actual column names in place of id_col_1 and id_col_2, obviously):

CREATE EXTERNAL TABLE some_table
…
TBLPROPERTIES (
  "projection.id_col_1.type" = "injected",
  "projection.id_col_2.type" = "injected",
  "storage.location.template" = "s3://bucket/${id_col_1}/${id_col_2}/"
)

Note that when querying a table that uses partition projection with the injected type all queries must contain explicit values for the the projected columns.

Theo
  • 131,503
  • 21
  • 160
  • 205
  • 1
    thanks a lot this is definitely an great response, and you are right I tested and there is no optimization whatsoever for path, I'll dive into partition projection for my use case – Carlos Alanis Sep 07 '20 at 14:26
  • I updated the answer with new information. Athena supports `$path` now (and has since engine version 2, I think). – Theo Mar 29 '23 at 09:34
  • Can you share a source for the 2023 update in this answer, that "$path" predicates in queries limit the files read? I am not seeing an clear improvement in performance in Athena v2 engine but I may have something else going on that is preventing the optimization from occurring. I didn't see anything obvious in the 2022 or 2023 changelog for Athena either but I may have missed it. – abc Jun 02 '23 at 01:41
  • 1
    This was just from observation. In v1 it was clear that it didn't result in a reduction in bytes scanned, but in v2 it is (compare the bytes scanned with and without a predicate that selects a single file with `$path` when there are more than one file in total). It's not mentioned in the changelog for Athena. It was introduced somewhere in Presto, and Athena release notes don't include every change in Presto/Trino, only the major ones, and anything Athena adds on top. – Theo Jun 02 '23 at 14:18