Array of JSON in Athena is read incorrectly and can't be unnested

Question

I have column called uf that contains an array of JSON objects. Here is a mockup:

[
    {"type": "browserId", "name": "", "value": "unknown"},
    {"type": "campaign", "name": "", "value": "om_227dec0082a5"},
    {"type": "custom", "name": "2351350529", "value": "10148"},
    {"type": "custom", "name": "9501713387", "value": "true"},
    {"type": "custom", "name": "9517735577", "value": "true"},
    {"type": "custom", "name": "9507402548", "value": "true"},
    {"type": "custom", "name": "9733902068", "value": "true"}
]

I'm trying to get this as child records but for some reason I can't find the right way to unnest it first. Then I noticed that my whole array is wrapped into another JSON object..

This is where I'm at:

I tried simple select and noticed that the result is:

[{type=[{"type": "browserId", "name": "", "value": "ff"}, name=null, value=null}]

The definition for this column is as follows:

{
    "Name": "uf",
    "Type": "array<struct<type:string,name:string,value:string>>"
}

Is the definition incorrect and that's why I get my whole array wrapped in another json array?

-- edit Here is the example of my csv file (tab delimited). Spent last two days trying to see if it's something about JSON that makes Glue not recognise column as array of JSON but I created a new column with simple array of JSON that was correctly assigned as array<struct but after querying I was getting exactly the same problem as above

timestamp   project_id  campaign_id experiment_id   variation_id    layer_holdback  audience_names  end_user_id uuid    session_id  snippet_revision    user_ip user_agent  user_engine user_engine_version referer global_holdback event_type  event_name  uf  active_views    event_features  event_metrics   event_uuid
1570326511  74971132    11089500404 11097730080 11078120202 false   []  oeu1535997971348r0.4399811351004357     AUTO    6540    5.91.170.0  Mozilla/5.0 (Linux; Android 7.0; SAMSUNG SM-G925F Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) SamsungBrowser/9.2 Chrome/67.0.3396.87 Mobile Safari/537.36  js  0.128.0 https://www.zavamed.com/uk/account/ false   view_activated  10832783364 [{"type": "browserId", "name": "", "value": "unknown"}, {"type": "device", "name": "", "value": "mobile"}, {"type": "device_type", "name": "", "value": "phone"}, {"type": "referrer", "name": "", "value": "https:\/\/www.google.co.uk\/"}, {"type": "source_type", "name": "", "value": "campaign"}, {"type": "currentTimestamp", "name": "", "value": "-1631518596"}, {"type": "offset", "name": "", "value": "-60"}]    []  []  []  4926a5f1-bbb5-4553-9d0b-b26f773fa0f4

Could you provide a full table DDL definition, i.e. `SHOW CREATE TABLE`, as well as a sample of you whole JSON file and not just for a column `uf`? When I tried to reproduce your JSON file, I got the same schema for `uf` column by using Glue crawler and was able to `UNNEST`. A single line in my file looked like `{"prop1": 1,"uf":[{"type": "1browserId", "name": "", "value": "1unknown"},{"type": "2browserId", "name": "", "value": "2unknown"}]}` — Ilya Kisil, Nov 11 '19 at 11:08
why dont you flatten this array via spark then use it in athena ? — sumitya, Nov 15 '19 at 08:56
@syadav don't know Spark (yet), hence trying to do it with the things I know — Moseleyi, Nov 18 '19 at 16:00

Ilya Kisil · Answer 1 · 2019-11-18T17:52:41.563

I uploaded a sample csv file onto S3 with the content you provided. Then I ran a glue crawler on it. Here is a table schema I ended up with:

CREATE EXTERNAL TABLE `question_58765672`(
  `timestamp` bigint, 
  `project_id` bigint, 
  `campaign_id` bigint, 
  `experiment_id` bigint, 
  `variation_id` bigint, 
  `layer_holdback` boolean, 
  `audience_names` array<string>, 
  `end_user_id` string, 
  `uuid` string, 
  `session_id` string, 
  `snippet_revision` bigint, 
  `user_ip` string, 
  `user_agent` string, 
  `user_engine` string, 
  `user_engine_version` string, 
  `referer` string, 
  `global_holdback` boolean, 
  `event_type` string, 
  `event_name` bigint, 
  `uf` string, 
  `active_views` array<string>, 
  `event_features` array<string>, 
  `event_metrics` array<string>, 
  `event_uuid` string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '\t' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://__S3_PATH_IN_MY_BUCKET__/'
TBLPROPERTIES (
  'CrawlerSchemaDeserializerVersion'='1.0', 
  'CrawlerSchemaSerializerVersion'='1.0', 
  'UPDATED_BY_CRAWLER'='some-crawler', 
  'areColumnsQuoted'='false', 
  'averageRecordSize'='553', 
  'classification'='csv', 
  'columnsOrdered'='true', 
  'compressionType'='none', 
  'delimiter'='\t', 
  'objectCount'='1', 
  'recordCount'='2', 
  'sizeKey'='1109', 
  'skip.header.line.count'='1', 
  'typeOfData'='file')

As you can see it identified column uf as string, which I wasn't really surprised about. In order to unnest this column, I had to manually cast it to the correct type ARRAY(JSON):

SELECT
    "timestamp",
    _unnested_column
FROM
    "stackoverflow"."question_58765672",
    UNNEST( CAST(json_parse(uf) AS ARRAY(JSON)) ) AS t(_unnested_column)

Result:

    timestamp   _unnested_column
1   1570326511  {"name":"","type":"browserId","value":"unknown"}
2   1570326511  {"name":"","type":"device","value":"mobile"}
3   1570326511  {"name":"","type":"device_type","value":"phone"}
4   1570326511  {"name":"","type":"referrer","value":"https://www.google.co.uk/"}
5   1570326511  {"name":"","type":"source_type","value":"campaign"}
6   1570326511  {"name":"","type":"currentTimestamp","value":"-1631518596"}
7   1570326511  {"name":"","type":"offset","value":"-60"}

Then I thought of creating a athena views, where, column uf would be casted correctly:

CREATE OR REPLACE VIEW question_58765672_v1_json AS
SELECT
    CAST(json_parse(uf) AS ARRAY(JSON)) as uf
    -- ALL other columns from your table
FROM 
    "stackoverflow"."question_58765672"

However, I got the following error:

Invalid column type for column uf: Unsupported Hive type: json

My guess, is that schema for column uf is either too complicated for for glue crawler in order to correctly identify it or just simply not supported by the used Serde, i.e. 'org.apache.hadoop.mapred.TextInputFormat' or 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'.

for some reason, cast to json works in select, but doesn't work if part of view. for example: select cast(5 as json) "test" works while |create view bla as select cast(5 as json) "test" | doesn't work — Sasa Ninkovic, Mar 27 '20 at 13:10
Athena uses Presto to run `Select` queries, but Hive to define tables and create views (see [FAQ](https://aws.amazon.com/athena/faqs/?nc1=h_ls)). As opposed to [Presto](https://docs.aws.amazon.com/athena/latest/ug/extracting-data-from-JSON.html), [Hive has no Json data type](https://docs.aws.amazon.com/athena/latest/ug/data-types.html). I think this is the reason you got the error "_Unsupported Hive type: json_". — Henrik Koberg, May 09 '22 at 07:17

Array of JSON in Athena is read incorrectly and can't be unnested

1 Answers1