2

I have setup Firehose to deliver GZIP JSON files to S3 with a yyyy/MM/dd/HH/ path template:

s3://bucket-name/events/2022/11/08/16/file-2022-11-08-16-47-41-xxx.gz

If I create an external table in Athena with NO projection:

CREATE EXTERNAL TABLE `noproj` (
  `ts` bigint,
  `url` string,
  `useragent` string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://bucket-name/events/'
TBLPROPERTIES (
  'classification'='json',
  'compressionType'='gzip',
  'typeOfData'='file'
)

Everything works fine and I get results with SELECT * FROM noproj

But if I create a partition projection:

CREATE EXTERNAL TABLE `proj` (
  `ts` bigint,
  `url` string,
  `useragent` string
)
PARTITIONED BY (
  `date_created` string,
  `hour` string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://bucket-name/events/'
TBLPROPERTIES (
  'classification'='json',
  'compressionType'='gzip',
  'typeOfData'='file',
  'projection.enabled' = 'true',
  'projection.date_created.type' = 'date',
  'projection.date_created.format' = 'yyyy/MM/dd',
  'projection.date_created.interval' = '1',
  'projection.date_created.interval.unit' = 'DAYS',
  'projection.date_created.range' = '2022/01/01, NOW',
  'projection.hour.type' = 'integer',
  'projection.hour.range' = '0,23',
  'projection.hour.digits' = '2',
  'storage.location.template'='s3://bucket-name/events/${date_created}/${hour}/'
)

I get NO results at all I tried the following queries:

  • SELECT * FROM proj
  • SELECT * FROM proj WHERE date_created = '2022/11/08'
  • SELECT * FROM proj WHERE date_created = '2022/11/08' AND hour = '16'
  • SELECT * FROM proj WHERE date_created >= date('2022-01-01')

Note that the last query throws an error:

SYNTAX_ERROR: line 1:41: '>=' cannot be applied to varchar, date

Probably related to Athena Partition Projection for Date column vs. String

I tried the following but with no success:


Edit: I tried with a simpler projection and it also doesn't work:

CREATE EXTERNAL TABLE `year` (
  `ts` bigint,
  `url` string,
  `useragent` string
)
PARTITIONED BY (
  `year` string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://bucket-name/events/'
TBLPROPERTIES (
  'classification'='json',
  'compressionType'='gzip',
  'projection.year.type' = 'integer',
  'projection.year.range' = '2022,2023',
  'projection.enabled' = 'true',
  'storage.location.template'='s3://bucket-name/events/${year}/'
)

Edit2: I also tried with no template location and

s3://bucket-name/events/year=2022/month=11/day=08/hour=16/file-2022-11-08-16-47-41-xxx.gz

Edit3: I also tried with parquet format and I have the exact same behaviour. Results with NO partition. No results with a projected partition

Yves M.
  • 29,855
  • 23
  • 108
  • 144

1 Answers1

0

Ok I find out the problem

I 'm using CloudFormation with YAML to setup AWS::KinesisFirehose::DeliveryStream

I was using >- multiline string notation:

Prefix: >-
  events/
  !{partitionKeyFromQuery:year}/
  !{partitionKeyFromQuery:month}/
  !{partitionKeyFromQuery:day}/
  !{partitionKeyFromQuery:hour}/

But I forget that the string gets joined with space characters!! And I didn't see the spaces in S3 console.. because a space is hard to spot.

This is the fix:

Prefix: events/!{partitionKeyFromQuery:year}/!{partitionKeyFromQuery:month}/!{partitionKeyFromQuery:day}/!{partitionKeyFromQuery:hour}/

If this help someone in the future please let me know that I have not suffered for nothing

Thank you fellow developers for your time.

Yves M.
  • 29,855
  • 23
  • 108
  • 144