7

I have an Athena table partitioned by year, month, day with the following definition

CREATE EXTERNAL TABLE `my_table`(
    `price` double) 
PARTITIONED BY ( 
    `year` int, 
    `month` int, 
    `day` int) 
ROW FORMAT SERDE 
    'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
    'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
    'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'

I need to query it between dates. The options, as I see it, for example are:

SELECT avg(price) 
FROM my_table 
WHERE year = 2018 AND month = 1

Result: Run time: 4.89 seconds, Data scanned: 20.72MB

SELECT avg(price) 
FROM my_table 
WHERE cast(date_parse(concat(cast(year as varchar(4)),'-',
                             cast(month as varchar(2)),'-',
                             cast(day as varchar(2))
                             ), '%Y-%m-%d') as date) 
BETWEEN Date '2018-01-01' AND Date '2018-01-31'

Result: Run time: 8.64 seconds, Data scanned: 20.72MB

So, I guess Athena is smart enough to use the partitioning feature even when casting the concatenated partitioned columns, so why does it take approx. twice the time? What exactly is going on in the backstage?

Many thanks.

gabra
  • 9,484
  • 4
  • 29
  • 45
Shirkan
  • 859
  • 1
  • 9
  • 14
  • Could you send the definition of the table? (SHOW CREATE TABLE table) – jbgorski Oct 23 '18 at 17:38
  • this is more or less the most important part of the definition: `CREATE EXTERNAL TABLE table( price double ) PARTITIONED BY ( year int, month int, day int) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'` – Shirkan Oct 25 '18 at 06:46

2 Answers2

0

Athena will use a filterPredicate in that case, you can use EXPLAIN ANALYSE statement to check this:

    EXPLAIN ANALYZE SELECT count(*) FROM "db"."table" 
    where year||month||day >= '20220629';
...
        - ScanFilterProject[table = awsdatacatalog:HiveTableHandle{schemaName=db, tableName=table, analyzePartitionValues=Optional.empty}, grouped = false, 
filterPredicate = ("concat"("concat"("year", "month"), "day") >= CAST('20220629' AS varchar))] => [[]]
                CPU: 2.57s (99.04%), Output: 12424 rows (0B)
                Input avg.: 49.11 rows, Input std.dev.: 54.32%
                LAYOUT: db.table
                month := month:string:-1:PARTITION_KEY
                    :: [[06], [07]]
                year := year:string:-1:PARTITION_KEY
                    :: [[2022]]
                day := day:string:-1:PARTITION_KEY
                    :: [[05], [06], [07], [11], [12], [13], [14], [15], [16], [17], [18], [19], [29], [30]]
                Input: 12424 rows (5.68kB), Filtered: 0.00%
...
Oleksandr Lykhonosov
  • 1,138
  • 12
  • 25
0

I would not rely on column concatenation of partitioned columns. Instead, it would be better to use a single date type partition with partition projection.

Here is an example of how to do this for a column called creation_date

PARTITIONED BY (
`creation_date` string
)
TBLPROPERTIES (
'projection.enabled' = 'true',
'projection.creation_date.type' = 'date',
'projection.creation_date.interval.unit' = 'DAYS',
'projection.creation_date.range' = '2000-01-01,NOW',
'projection.creation_date.format' = 'yyy-MM-dd'
)

This would allow for simple, well planned queries (e.g. date between 2021-12-01 and 2022-01-20), and using partition projection should prevent any issue with # of partition limits.

Also, here is my additional experience for how Athena behaved with concatenated and cast partition columns.

When I used partition projection with year/month/ partitions, Athena was not smart enough to use the partitions when casting and concatenating the year and month.

Athena scanned the entirety of the year partition files for each year, and took significantly (10x) longer than if I pulled the two months directly with an OR statement.

The below query covers data from December 2020 to January 2021, and it (unfortunately) scanned all of the data in the 2020 and 2021 partitions:

EXPLAIN ANALYZE SELECT * FROM "telemetry_test"."sdmym-day-batch-parquet-labelled" WHERE sid='19009' AND mid='1000' AND di='e5781489-fa7f-4bc5-8858-0c3cc6ba80c7' 
    AND (cast(year as varchar)||cast(month as varchar)<='202101' OR cast(year as varchar)||cast(month as varchar)>='202012');

Which returned

Query Plan
Fragment 1 
    CPU: 12.28s, Input: 4210560 rows (132.80MB), Data Scanned: 63.15MB; per task: std.dev.: 2880.00, Output: 4210560 rows (145.35MB)
    Output layout: [ts, value, sid, di, mid, year, month]
    - ScanFilter[table = awsdatacatalog:HiveTableHandle{schemaName=telemetry_test, tableName=sdmym-day-batch-parquet-labelled, analyzePartitionValues=Optional.empty}, grouped = false, filterPredicate = (("concat"(CAST("year" AS varchar), CAST("month" AS varchar)) <= CAST('202101' AS varchar)) OR ("concat"(CAST("year" AS varchar), CAST("month" AS varchar)) >= CAST('202012' AS varchar)))] => [[ts, value, sid, di, mid, year, month]]
            CPU: 12.27s (100.00%), Output: 4210560 rows (145.35MB)
            Input avg.: 5760.00 rows, Input std.dev.: 0.00%
            LAYOUT: telemetry_test.sdmym-day-batch-parquet-labelled
            month := month:int:-1:PARTITION_KEY
                :: [[1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12]]
            year := year:int:-1:PARTITION_KEY
                :: [[2020], [2021], [2022]]
            di := di:string:-1:PARTITION_KEY
                :: [[e5781489-fa7f-4bc5-8858-0c3cc6ba80c7]]
            mid := mid:string:-1:PARTITION_KEY
                :: [[1000]]
            value := value:double:1:REGULAR
            sid := sid:string:-1:PARTITION_KEY
                :: [[19009]]
            ts := ts:string:0:REGULAR
            Input: 4210560 rows (132.80MB), Filtered: 0.00%
zthatch56
  • 21
  • 8