How to parse and transform json string from spark dataframe rows in pyspark?
I'm looking for help how to parse:
- json string to json struct
output 1
- transform json string to columns a, b and id
output 2
Background: I get via API json strings with a large number of rows (jstr1
, jstr2
, ...), which are saved to spark df
. I can read schema for each row separately, but this is not the solution as it is very slow as schema has a large number of rows. Each jstr
has the same schema, columns/keys a and b stays the same, just id
and values in columns change.
EDIT: blackbishop solution to use MapType schema works like a charm schema = "map<string, array<struct<a:int,b:int>>>"
Question was extended to: How to transform JSON string with multiple keys, from spark data frame rows in pyspark?
from pyspark.sql import Row
jstr1 = '{"id_1": [{"a": 1, "b": 2}, {"a": 3, "b": 4}]}'
jstr2 = '{"id_2": [{"a": 5, "b": 6}, {"a": 7, "b": 8}]}'
df = sqlContext.createDataFrame([Row(json=jstr1),Row(json=jstr2)])
schema = F.schema_of_json(df.select(F.col("json")).take(1)[0].json)
df2 = df.withColumn('json', F.from_json(F.col('json'), schema))
df2.show()
Current output:
+--------------------+
| json|
+--------------------+
|[[[1, 2], [3, 4]]] |
| []|
+--------------------+
Required output 1:
+--------------------+-------+
| json | id |
+--------------------+-------+
|[[[1, 2], [3, 4]]] | id_1 |
|[[[5, 6], [7, 8]]] | id_2 |
+--------------------+-------+
Required output 2:
+---------+----------+-------+
| a | b | id |
+--------------------+-------+
| 1 | 2 | id_1 |
| 3 | 4 | id_1 |
| 5 | 6 | id_2 |
| 7 | 8 | id_2 |
+---------+----------+-------+