I would not rely on column concatenation of partitioned columns. Instead, it would be better to use a single date type partition with partition projection.
Here is an example of how to do this for a column called creation_date
PARTITIONED BY (
`creation_date` string
)
TBLPROPERTIES (
'projection.enabled' = 'true',
'projection.creation_date.type' = 'date',
'projection.creation_date.interval.unit' = 'DAYS',
'projection.creation_date.range' = '2000-01-01,NOW',
'projection.creation_date.format' = 'yyy-MM-dd'
)
This would allow for simple, well planned queries (e.g. date between 2021-12-01 and 2022-01-20), and using partition projection should prevent any issue with # of partition limits.
Also, here is my additional experience for how Athena behaved with concatenated and cast partition columns.
When I used partition projection with year/month/ partitions, Athena was not smart enough to use the partitions when casting and concatenating the year and month.
Athena scanned the entirety of the year partition files for each year, and took significantly (10x) longer than if I pulled the two months directly with an OR statement.
The below query covers data from December 2020 to January 2021, and it (unfortunately) scanned all of the data in the 2020 and 2021 partitions:
EXPLAIN ANALYZE SELECT * FROM "telemetry_test"."sdmym-day-batch-parquet-labelled" WHERE sid='19009' AND mid='1000' AND di='e5781489-fa7f-4bc5-8858-0c3cc6ba80c7'
AND (cast(year as varchar)||cast(month as varchar)<='202101' OR cast(year as varchar)||cast(month as varchar)>='202012');
Which returned
Query Plan
Fragment 1
CPU: 12.28s, Input: 4210560 rows (132.80MB), Data Scanned: 63.15MB; per task: std.dev.: 2880.00, Output: 4210560 rows (145.35MB)
Output layout: [ts, value, sid, di, mid, year, month]
- ScanFilter[table = awsdatacatalog:HiveTableHandle{schemaName=telemetry_test, tableName=sdmym-day-batch-parquet-labelled, analyzePartitionValues=Optional.empty}, grouped = false, filterPredicate = (("concat"(CAST("year" AS varchar), CAST("month" AS varchar)) <= CAST('202101' AS varchar)) OR ("concat"(CAST("year" AS varchar), CAST("month" AS varchar)) >= CAST('202012' AS varchar)))] => [[ts, value, sid, di, mid, year, month]]
CPU: 12.27s (100.00%), Output: 4210560 rows (145.35MB)
Input avg.: 5760.00 rows, Input std.dev.: 0.00%
LAYOUT: telemetry_test.sdmym-day-batch-parquet-labelled
month := month:int:-1:PARTITION_KEY
:: [[1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12]]
year := year:int:-1:PARTITION_KEY
:: [[2020], [2021], [2022]]
di := di:string:-1:PARTITION_KEY
:: [[e5781489-fa7f-4bc5-8858-0c3cc6ba80c7]]
mid := mid:string:-1:PARTITION_KEY
:: [[1000]]
value := value:double:1:REGULAR
sid := sid:string:-1:PARTITION_KEY
:: [[19009]]
ts := ts:string:0:REGULAR
Input: 4210560 rows (132.80MB), Filtered: 0.00%