I have an S3 bucket archiving JSON objects via Kinesis Firehose. Each bucket object can contain multiple JSON objects that can vary in the schema.
Bucket structure
bucket
└── archive
└── 2021
└── 04
├── 11
| ├── 10
| | └── object-1
└── 14
└── 09
└── object-2
First object - s3://bucket/archive/2021/04/11/10/object-1
{"id":1,"timestamp":"2021-04-11T10:01:20Z","nestedData":{"moreNesting":{"someData":"value"}}}
{"id":2,"timestamp":"2021-04-11T12:00:31Z","someDifferentlyNestedData":{"differentNesting":{"someOtherData":"value"}}}
Second object - s3://bucket/archive/2021/04/14/09/object-2
{"id":3,"timestamp":"2021-04-14T09:02:30Z","noNestedData":"value"}
I have set up a Glue Crawler to create the Data Catalogue Table out of my data.
Data Catalogue Table
I am using AWS Athena to query that data. When querying, I can retrieve all the columns of the target JSON object with the following query:
All columns query
SELECT *
FROM archive
WHERE nesteddata.morenesting.somedata = 'value';
All columns query result
I can also retrieve the S3 object path of the JSON object with the following query:
S3 path query
SELECT "$path"
FROM archive
WHERE nesteddata.morenesting.somedata = 'value';
S3 path query results
Question
My question is, how can I retrieve the original JSON object via a query, i.e. how can I have a query result that returns something like the following: