1

In a table having 2 columns(my_col and ID) with many records, below are 2 sample records

Record 1 : Column 1 - my_col value as: {"XXX": ["123","456"],"YYY": ["246","135"]} and Column 2 - ID as A123

Record 2 : Column 1 - my_col value as: {"ZZZ":["333"]} and Column 2 - ID as B222

Need to flatten using pyspark:

Key Value ID
XXX 123 A123
XXX 456 A123
YYY 246 A123
YYY 135 A123
ZZZ 333 B222

Can you please help on this issue?

FYI... Achieved the result in Hive:


select split(element,':')[0] key,
       e.value,
       id
from
(
SELECT 
 regexp_replace(e.element,'^\\{|"| *\\[|\\]|\\}$','') element, 
 t.id 
FROM
    input_df t
    lateral view explode(split(my_col,'(?<=\\]) *, *(?=\\")')) e as element 
)s lateral view explode(split(split(element,':')[1],',')) e as value 
Shan
  • 61
  • 7

1 Answers1

0

The trick here is casting the JSON column into a map and explode all values until flat it

import pyspark.sql.functions as f


input_df = spark.createDataFrame([
  ['A123', '{"XXX": ["123","456"],"YYY": ["246","135"]}'],
  ['B222', '{"ZZZ":["333"]}']
], schema='id string, my_col string')

output_df = (input_df
             .withColumn('entries', f.from_json('my_col', 'map<string, array<string>>'))
             .select('id', f.explode('entries'))
             .select('ID', 'Key', f.explode('value').alias('Value')))

output_df.show(truncate=False)
+----+---+-----+
|ID  |Key|Value|
+----+---+-----+
|A123|XXX|123  |
|A123|XXX|456  |
|A123|YYY|246  |
|A123|YYY|135  |
|B222|ZZZ|333  |
+----+---+-----+
Kafels
  • 3,864
  • 1
  • 15
  • 32
  • Getting ERROR from dataframe input_df. Error: raise ValueError("Could not parse datatype: %s" % s) ValueError: Could not parse datatype:ID String, json_col String – Shan Sep 13 '21 at 16:02
  • Which version are you using? I did it on Spark 3 – Kafels Sep 13 '21 at 18:19