JSON parse in Pyspark

Question

In a table having 2 columns(my_col and ID) with many records, below are 2 sample records

Record 1 : Column 1 - my_col value as: {"XXX": ["123","456"],"YYY": ["246","135"]} and Column 2 - ID as A123

Record 2 : Column 1 - my_col value as: {"ZZZ":["333"]} and Column 2 - ID as B222

Need to flatten using pyspark:

Key	Value	ID
XXX	123	A123
XXX	456	A123
YYY	246	A123
YYY	135	A123
ZZZ	333	B222

Can you please help on this issue?

FYI... Achieved the result in Hive:


select split(element,':')[0] key,
       e.value,
       id
from
(
SELECT 
 regexp_replace(e.element,'^\\{|"| *\\[|\\]|\\}$','') element, 
 t.id 
FROM
    input_df t
    lateral view explode(split(my_col,'(?<=\\]) *, *(?=\\")')) e as element 
)s lateral view explode(split(split(element,':')[1],',')) e as value

Have you tried using [from_json()](https://stackoverflow.com/questions/41107835/pyspark-parse-a-column-of-json-strings)? — Nick ODell, Sep 10 '21 at 23:27

score 0 · Answer 1 · answered Sep 11 '21 at 23:59

The trick here is casting the JSON column into a map and explode all values until flat it

import pyspark.sql.functions as f


input_df = spark.createDataFrame([
  ['A123', '{"XXX": ["123","456"],"YYY": ["246","135"]}'],
  ['B222', '{"ZZZ":["333"]}']
], schema='id string, my_col string')

output_df = (input_df
             .withColumn('entries', f.from_json('my_col', 'map<string, array<string>>'))
             .select('id', f.explode('entries'))
             .select('ID', 'Key', f.explode('value').alias('Value')))

output_df.show(truncate=False)
+----+---+-----+
|ID  |Key|Value|
+----+---+-----+
|A123|XXX|123  |
|A123|XXX|456  |
|A123|YYY|246  |
|A123|YYY|135  |
|B222|ZZZ|333  |
+----+---+-----+

Getting ERROR from dataframe input_df. Error: raise ValueError("Could not parse datatype: %s" % s) ValueError: Could not parse datatype:ID String, json_col String — Shan, Sep 13 '21 at 16:02

JSON parse in Pyspark

1 Answers1