2

I am trying to import data in the following format into a hive table

[
    {
      "identifier" : "id#1",
      "dataA" : "dataA#1"
    },
    {
      "identifier" : "id#2",
      "dataA" : "dataA#2"
    }
]

I have multiple files like this and I want each {} to form one row in the table. This is what I have tried:

CREATE EXTERNAL TABLE final_table(
    identifier STRING,
    dataA STRING
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION "s3://bucket/path_in_bucket/"

This is not creating a single row for each {} though. I have also tried

CREATE EXTERNAL TABLE final_table(
    rows ARRAY< STRUCT<
    identifier: STRING,
    dataA: STRING
    >>
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION "s3://bucket/path_in_bucket/"

but this is not work either. Is there some way of specifying that the input as an array with each record being an item in the array to the hive query? Any suggestions on what to do?

shrewquest
  • 541
  • 1
  • 7
  • 22

2 Answers2

2

Here is what you need

Method 1: Adding name to the array

Data

{"data":[{"identifier" : "id#1","dataA" : "dataA#1"},{"identifier" : "id#2","dataA" : "dataA#2"}]}

SQL

SET hive.support.sql11.reserved.keywords=false;

CREATE EXTERNAL TABLE IF NOT EXISTS ramesh_test (
  data array<
    struct<
      identifier:STRING, 
      dataA:STRING
    >
  >
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 'my_location';

SELECT rows.identifier,
       rows.dataA
  FROM ramesh_test d
LATERAL VIEW EXPLODE(d.data) d1 AS rows  ;

Output

enter image description here

Method 2 - No Changes to the data

Data

[{"identifier":"id#1","dataA":"dataA#1"},{"identifier":"id#2","dataA":"dataA#2"}]

SQL

CREATE EXTERNAL TABLE IF NOT EXISTS ramesh_raw_json (
  json STRING
)
LOCATION 'my_location';

SELECT get_json_object (exp.json_object, '$.identifier') AS Identifier,
       get_json_object (exp.json_object, '$.dataA') AS Identifier
  FROM ( SELECT json_object
           FROM ramesh_raw_json a
           LATERAL VIEW EXPLODE (split(regexp_replace(regexp_replace(a.json,'\\}\\,\\{','\\}\\;\\{'),'\\[|\\]',''), '\\;')) json_exploded AS json_object ) exp;

Output

enter image description here

Ramesh
  • 1,405
  • 10
  • 19
  • This requires modifying to {"data":}. I want to find out if there's anyway I can do this without having to modify the existing data. – shrewquest Nov 29 '17 at 03:05
  • @shrewquest - If you didn't want to name your array object then I think you will have to explode the array and use get_json_object to get the values. In my experience that's messy and slow in term of performance. My suggestion would be to create a temp table to concat the name to the array object. You could either do that in a UNIX script or in Hive itself using a a temp table. – Ramesh Dec 01 '17 at 08:53
  • @shrewquest - Updated the answer with method 2 which doesn't require changes to your data. hope this helps! – Ramesh Dec 01 '17 at 21:44
0

JSON records in data files must appear one per line, an empty line would produce a NULL record.

This json should work

{ "identifier" : "id#1", "dataA" : "dataA#1" }, { "identifier" : "id#2", "dataA" : "dataA#2" }

sandy kay
  • 115
  • 1
  • 14
  • The json I put up is formatted for readability. If you mean get rid of [ ] then I have 11672 files, is there some option to specify that each file is an array is what I'm asking – shrewquest Nov 28 '17 at 14:34
  • Can you provide a sample expected output for better understanding – sandy kay Nov 28 '17 at 14:37
  • expected output is a hive table with each item in the array as one row in the table – shrewquest Nov 28 '17 at 14:40
  • I had the same question in https://stackoverflow.com/questions/61969182/extract-json-in-array-in-aws-athena now 2 years later. Too bad none of the suggested solutions here created the results you wanted, so that I could have done the same. – Johnathan May 23 '20 at 09:01