1

I have the following json structure which is stored as a string in my hive table column

[  
{"outer_id": 123000, "outer_field_1": blah, "inner_list": [{"inner_id": 456}, {"inner_id": 789}]},  
{"outer_id": 123001, "outer_field_1": blahblah, "inner_list": [{"inner_id": 456}, {"inner_id": 789}]},  
{"outer_id": 123002, "outer_field_1": blahblahblah,  "inner_list": [{"inner_id": 456}, {"inner_id": 789}]},  
]

Now I want to parse this into a table the elements of the outer array as each row. The fields of each json object is parsed as each column, while still keeping inner list as string:


| outer_id | outer_field_1 | inner_list  |   |   |   
|----------|---------------|-------------|---|---|  
| 123000   |  blah         |  struct     |   |   |  
| 123001   |  blahblah     |  struct     |   |   |  
| 123002   |  blahblahblah |  struct     |   |   |  

Now I am aware of the trick to use regex to create a custom separator, split on it then use lateral view explode, but in this case, there are also nested arrays which will match the regex: Parse json arrays using HIVE

Any ideas on how to do this? I want to do this in raw Spark-SQL if possible. No UDFs or Serdes.

What I have tried:

  1. select explode(get_json_object(outer_list, "$[*]")) from wt_test;

doesnt work, it says input to function explode should be array or map type, not string

  1. select explode(split(substr(outer_list, 2, length(strategies)-2),",")) from wt_test;

this splits every single comma into a row which is not what we want:

{"outer_id": 123000
"outer_field_1": blah
"inner_list": [{"inner_id": 456}
{"inner_id": 789}]}
... more rows ...
Lumo Woong
  • 77
  • 1
  • 11
  • could you maybe edit your desired output a little bit more. I am not sure what you expect. – Matt Oct 30 '20 at 08:49
  • @Matt edited for clarity – Lumo Woong Oct 30 '20 at 09:00
  • so inner_list is not important? – Matt Oct 30 '20 at 09:15
  • it is, but at this point I just want to explode the outer list, the whole requirements is kinda complicated... Let me edit the question to still keep the inner list as a string – Lumo Woong Oct 30 '20 at 09:21
  • ahhh sry I didnt fully read it so basicly you have a table like this right: {"a": 1, "outer_list":[{"outer_id": 123000, "outer_field_1": blah, "inner_list": [{"inner_id": 456}, {"inner_id": 789}]}, {"outer_id": 123001, "outer_field_1": blahblah, "inner_list": [{"inner_id": 456}, {"inner_id": 789}]}, {"outer_id": 123002, "outer_field_1": blahblahblah, "inner_list": [{"inner_id": 456}, {"inner_id": 789}]}, ]} – Matt Oct 30 '20 at 09:26
  • yes and the whole outer_list is stored as a string – Lumo Woong Oct 30 '20 at 09:28

1 Answers1

1

Assuming I did understand correctly you have the following:

Input

{
   "some_id":1,
   "outer_list":'[{"outer_id": 123000, "outer_field_1": "blah", "inner_list": [{"inner_id": 456}, {"inner_id": 789}]}, {"outer_id": 123001, "outer_field_1": "blahblah", "inner_list": [{"inner_id": 456}, {"inner_id": 789}]}, {"outer_id": 123002, "outer_field_1": "blahblahblah", "inner_list": [{"inner_id": 456}, {"inner_id": 789}]}]'
}

Desired Output:

| outer_id | outer_field_1 | inner_list  |   |   |   
|----------|---------------|-------------|---|---|  
| 123000   |  blah         |  struct     |   |   |  
| 123001   |  blahblah     |  struct     |   |   |  
| 123002   |  blahblahblah |  struct     |   |   |  

First of all you need to parse the string into a schema for that define a schema:

schema = ArrayType(
   StructType([StructField('outer_id', IntegerType()), 
               StructField('outer_field_1', StringType()), 
               StructField('inner_list', StringType())])
)

Notice this is the simple version where Inner_List is just taken as string.

Apply that schema on your dataframe:

df = df.select(from_json('outer_list', schema).alias('test'))

Now you have a column with an array:

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|test                                                                                                                                                                       |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[[123000, blah, [{"inner_id":456},{"inner_id":789}]], [123001, blahblah, [{"inner_id":456},{"inner_id":789}]], [123002, blahblahblah, [{"inner_id":456},{"inner_id":789}]]]|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

this you can explode now:

df.select(explode('test').alias('exploded')).select('exploded.*')

Which makes it:

+--------+-------------+-----------------------------------+
|outer_id|outer_field_1|inner_list                         |
+--------+-------------+-----------------------------------+
|123000  |blah         |[{"inner_id":456},{"inner_id":789}]|
|123001  |blahblah     |[{"inner_id":456},{"inner_id":789}]|
|123002  |blahblahblah |[{"inner_id":456},{"inner_id":789}]|
+--------+-------------+-----------------------------------+

Now while you are anyway parsing outer_list you can from the beginning do the same with inner_list. But you should maybe try that yourself first, you have everything here what you need for that.

Dont forget to import:

from pyspark.sql.functions import *
from pyspark.sql.types import *

SQL Version if the input is given as table json_test:

select exploded.* from 
   (select explode(
             from_json(
               outer_list, 
    "array<struct<outer_id:int,outer_field_1:string,inner_list:string>>"
       )
    ) as exploded from json_test
)
Matt
  • 620
  • 4
  • 11
  • Hello Matt, thanks for your answer. If I am using pyspark or scala this would all be easy, but I would like to do this in raw SQL, which means the "SELECT * FROM xxx" kind of queries you fire into the CLI – Lumo Woong Oct 30 '20 at 10:42
  • Hi @Matt the input is given as json string, so the conversion need to be done with pure SQL as well – Lumo Woong Oct 31 '20 at 01:46
  • This is pure sql see the end of my answer where i use sql to parse json i am not sure what you need more – Matt Oct 31 '20 at 12:37
  • Hi @matt, the above SQL is not working for me. I wrote the following in my spark-sql query and its throwing me errors: ``` select from_json("[{\"outer_id\": 123000, \"outer_field_1\": \"blah\", \"inner_list\": [{\"inner_id\": 456}, {\"inner_id\": 789}]}]", "array>") ``` Seems like a syntax error with from_json: == SQL == array> -----^^^ ; line 1 pos 7 – Lumo Woong Oct 31 '20 at 13:49
  • I think we need to add a root key and we are good, see: https://stackoverflow.com/questions/41107835/pyspark-parse-a-column-of-json-strings – Lumo Woong Oct 31 '20 at 14:00