I'm looking for a help, how to parse json string with multiple keys to json struct, see required output
.
Answer below shows how to transform JSON string with one Id
:
jstr1 = '{"id_1": \[{"a": 1, "b": 2}, {"a": 3, "b": 4}\]}'
How to parse and transform json string from spark data frame rows in pyspark
How to transform thousands of Ids in jstr1
, jstr2
, when number of Ids per JSON string change in each string.
Current Code:
jstr1 = """
{"id_1": [{"a": 1, "b": 2}, {"a": 3, "b": 4}],
"id_2": [{"a": 5, "b": 6}, {"a": 7, "b": 8}]}
"""
jstr2 = """
{"id_3": [{"a": 9, "b": 10}, {"a": 11, "b": 12}],
"id_4": [{"a": 12, "b": 14}, {"a": 15, "b": 16}],
"id_5": [{"a": 17, "b": 18}, {"a": 19, "b": 10}]}
"""
schema = "map<string, array<struct<a:int,b:int>>>"
df = sqlContext.createDataFrame([Row(json=jstr1),Row(json=jstr2)]) \
.withColumn('json', F.from_json(F.col('json'), schema))
output = df.withColumn("id", F.map_keys("json").getItem(0)) \
.withColumn("json", F.map_values("json").getItem(0))
output.show(truncate=False)
Current output:
+-------------------+----+
|json |id |
+-------------------+----+
|[[1, 2], [3, 4]] |id_1|
|[[9, 10], [11, 12]]|id_3|
+-------------------+----+
Required output:
+---------------------+------+
| json | id |
+---------------------+------+
|[[[1, 2], [3, 4]]] | id_1 |
|[[[5, 6], [7, 8]]] | id_2 |
|[[[9,10], [11,12]]] | id_3 |
|[[[13,14], [15,16]]] | id_4 |
|[[[17,18], [19,20]]] | id_5 |
+---------------------+------+
# NOTE: There is a large number of Ids in each JSON string
# so hard coded getItem(0), getItem(1) ... is not valid solution
...
|[[[1000,1001], [10002,1003 ]]] | id_100000 |
+-------------------------------+-----------+