I have setup Firehose to deliver GZIP JSON files to S3 with a yyyy/MM/dd/HH/
path template:
s3://bucket-name/events/2022/11/08/16/file-2022-11-08-16-47-41-xxx.gz
If I create an external table in Athena with NO projection:
CREATE EXTERNAL TABLE `noproj` (
`ts` bigint,
`url` string,
`useragent` string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://bucket-name/events/'
TBLPROPERTIES (
'classification'='json',
'compressionType'='gzip',
'typeOfData'='file'
)
Everything works fine and I get results with SELECT * FROM noproj
But if I create a partition projection:
CREATE EXTERNAL TABLE `proj` (
`ts` bigint,
`url` string,
`useragent` string
)
PARTITIONED BY (
`date_created` string,
`hour` string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://bucket-name/events/'
TBLPROPERTIES (
'classification'='json',
'compressionType'='gzip',
'typeOfData'='file',
'projection.enabled' = 'true',
'projection.date_created.type' = 'date',
'projection.date_created.format' = 'yyyy/MM/dd',
'projection.date_created.interval' = '1',
'projection.date_created.interval.unit' = 'DAYS',
'projection.date_created.range' = '2022/01/01, NOW',
'projection.hour.type' = 'integer',
'projection.hour.range' = '0,23',
'projection.hour.digits' = '2',
'storage.location.template'='s3://bucket-name/events/${date_created}/${hour}/'
)
I get NO results at all I tried the following queries:
SELECT * FROM proj
SELECT * FROM proj WHERE date_created = '2022/11/08'
SELECT * FROM proj WHERE date_created = '2022/11/08' AND hour = '16'
SELECT * FROM proj WHERE date_created >= date('2022-01-01')
Note that the last query throws an error:
SYNTAX_ERROR: line 1:41: '>=' cannot be applied to varchar, date
Probably related to Athena Partition Projection for Date column vs. String
I tried the following but with no success:
Edit: I tried with a simpler projection and it also doesn't work:
CREATE EXTERNAL TABLE `year` (
`ts` bigint,
`url` string,
`useragent` string
)
PARTITIONED BY (
`year` string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://bucket-name/events/'
TBLPROPERTIES (
'classification'='json',
'compressionType'='gzip',
'projection.year.type' = 'integer',
'projection.year.range' = '2022,2023',
'projection.enabled' = 'true',
'storage.location.template'='s3://bucket-name/events/${year}/'
)
Edit2: I also tried with no template location and
s3://bucket-name/events/year=2022/month=11/day=08/hour=16/file-2022-11-08-16-47-41-xxx.gz
Edit3: I also tried with parquet format and I have the exact same behaviour. Results with NO partition. No results with a projected partition