18

I have an athena table which I did not create or manage, but can query. one of the fields is a struct type. for the sake of the example let's suppose it looks like this:

my_field struct<a:string,
                b:string,
                c:struct<d:string,e:string>
                >

Now, I know how to query specific fields within this struct. But in one of my queries I need to extract the complete struct. so I just use:

select my_field from my_table

and the result looks like a string:

{a=aaa, b=bbb, c={d=ddd, e=eee}}

I want to get the result as a json string:

{"a":"aaa", "b":"bbb","c":{"d":"ddd", "e":"eee"}}

this string will then be processed by another application, this is why i need it in json format.

How can I achieve this?

EDIT: Better still, is there a way to query the struct in a way that flattens it? so the result would look like:

a   |   b   |   c.d  |  c.e   |
-------------------------------
aaa |   bbb |   ddd  |  eee   |
amit
  • 3,332
  • 6
  • 24
  • 32

3 Answers3

5

You can directly reference nested fields with a parent_field.child_field notation. Try:

SELECT
  my_field,
  my_field.a,
  my_field.b,
  my_field.c.d,
  my_field.c.e
FROM 
  my_table
James
  • 11,721
  • 2
  • 35
  • 41
  • 6
    I said I know that :-) the problem is making such a query for a very big struct, and especially in cases when you don't know in advance its structure, you just know you want to make it flat. – amit Mar 11 '18 at 05:53
  • 1
    Hello @amit, did you had luck making it flat without knowing its structure? – nariver1 May 08 '20 at 20:37
0

We can convert the structs from athena output to objects by Post processing. Below script may help

Assuming sample string received for the nested object

   {description=Check the Primary key count of TXN_EVENT table in Oracle, datastore_order=1, zone=yellow, aggregation_type=count, updatedcount=0, updatedat=[2021-06-09T02:03:20.243Z]}

It can be parsed using the help of this npm package athena-struct-parser package.

  1. Nodejs -- https://www.npmjs.com/package/athena-struct-parser
  2. Python -- AWS Athena export array of structs to JSON

Sample Code

var parseStruct =require('athena-struct-parser') ;
var str = '{description=Check the Primary key count of TXN_EVENT table in Oracle, datastore_order=1, zone=yellow, aggregation_type=count, updatedcount=0, updatedat=[2021-06-09T02:03:20.243Z]}'
var parseObj = parseStruct(str)
console.log(parseObj);

Result Parsed Output

{
  description: 'Check the Primary key count of TXN_EVENT table in Oracle',
  datastore_order: '1',
  zone: 'yellow',
  aggregation_type: 'count',
  updatedcount: '0',
  updatedat: [ '2021-06-09T02:03:20.004Z' ]
}
Ranjithkumar MV
  • 804
  • 8
  • 10
0

Gave a response to a similar question: AWS Athena export array of structs to JSON

I used a simple approach to get around the struct -> json Athena limitation. I created a second table where the json columns were saved as raw strings. Using presto json and array functions I was able to query the data and return the valid json string to my program:

--Array transform functions too
select 
  json_extract_scalar(dd, '$.timestamp') as timestamp,
  transform(cast(json_extract(json_parse(dd), '$.stats') as ARRAY<JSON>), x -> json_extract_scalar(x, '$.time')) as arr_stats_time,
  transform(cast(json_extract(json_parse(dd), '$.stats') as ARRAY<JSON>), x -> json_extract_scalar(x, '$.mean')) as arr_stats_mean,
  transform(cast(json_extract(json_parse(dd), '$.stats') as ARRAY<JSON>), x -> json_extract_scalar(x, '$.var')) as arr_stats_var
from 
(select '{"timestamp":1520640777.666096,"stats":[{"time":15,"mean":45.23,"var":0.31},{"time":19,"mean":17.315,"var":2.612}],"dets":[{"coords":[2.4,1.7,0.3], "header":{"frame":1,"seq":1,"name":"hello"}}],"pos": {"x":5,"y":1.4,"theta":0.04}}' as dd);

I know the query will take longer to execute but there are ways to optimize.

zipate
  • 166
  • 1
  • 4