Spark: How to parse and transform json string from spark data frame rows

Question

How to parse and transform json string from spark dataframe rows in pyspark?

I'm looking for help how to parse:

json string to json struct output 1
transform json string to columns a, b and id output 2

Background: I get via API json strings with a large number of rows (jstr1, jstr2, ...), which are saved to spark df. I can read schema for each row separately, but this is not the solution as it is very slow as schema has a large number of rows. Each jstr has the same schema, columns/keys a and b stays the same, just id and values in columns change.

EDIT: blackbishop solution to use MapType schema works like a charm schema = "map<string, array<struct<a:int,b:int>>>"

Question was extended to: How to transform JSON string with multiple keys, from spark data frame rows in pyspark?

from pyspark.sql import Row
jstr1 = '{"id_1": [{"a": 1, "b": 2}, {"a": 3, "b": 4}]}'
jstr2 = '{"id_2": [{"a": 5, "b": 6}, {"a": 7, "b": 8}]}'
    
df = sqlContext.createDataFrame([Row(json=jstr1),Row(json=jstr2)])
    
schema = F.schema_of_json(df.select(F.col("json")).take(1)[0].json)
df2 = df.withColumn('json', F.from_json(F.col('json'), schema))
df2.show()

Current output:

+--------------------+
|                json|
+--------------------+
|[[[1, 2], [3, 4]]]  |
|                  []|
+--------------------+

Required output 1:

+--------------------+-------+
|         json      |   id   |
+--------------------+-------+
|[[[1, 2], [3, 4]]] |   id_1 |
|[[[5, 6], [7, 8]]] |   id_2 |
+--------------------+-------+

Required output 2:

+---------+----------+-------+
|    a    |     b    |   id  |
+--------------------+-------+
|    1    |    2     |  id_1 |
|    3    |    4     |  id_1 |
|    5    |    6     |  id_2 |
|    7    |    8     |  id_2 |
+---------+----------+-------+

Can you specify the json rows output that you get from an api call in the question? or Do you get a single json string in one api call? — Nikunj Kakadiya, Jan 29 '21 at 15:47
Does this answer your question? [Pyspark: Parse a column of json strings](https://stackoverflow.com/questions/41107835/pyspark-parse-a-column-of-json-strings) — Chris, Jan 29 '21 at 15:50
@Chris this doesn't answer the question as MapType schema must be used to solve the problem — Dan, Jan 30 '21 at 20:56

score 3 · Accepted Answer · answered Jan 29 '21 at 16:58

You're getting null for the second row because you're using only the schema of the first row which is different from the second one. You can parse the JSON to a MapType instead, where the keys are of type string and values of type array of structs :

schema = "map<string, array<struct<a:int,b:int>>>"

df = df.withColumn('json', F.from_json(F.col('json'), schema))

df.printSchema()
#root
# |-- json: map (nullable = true)
# |    |-- key: string
# |    |-- value: array (valueContainsNull = true)
# |    |    |-- element: struct (containsNull = true)
# |    |    |    |-- a: integer (nullable = true)
# |    |    |    |-- b: integer (nullable = true)

Then, with some simple transformations, you get the expected outputs:

The id column represents the key in the map, you get it with map_keys function
The structs <a:int, b:int> represents the values that you get using map_values function

output1 = df.withColumn("id", F.map_keys("json").getItem(0)) \
            .withColumn("json", F.map_values("json").getItem(0))

output1.show(truncate=False)

# +----------------+----+
# |json            |id  |
# +----------------+----+
# |[[1, 2], [3, 4]]|id_1|
# |[[5, 6], [7, 8]]|id_2|
# +----------------+----+

output2 = output1.withColumn("attr", F.explode("json")) \
    .select("id", "attr.*")

output2.show(truncate=False)

# +----+---+---+
# |id  |a  |b  |
# +----+---+---+
# |id_1|1  |2  |
# |id_1|3  |4  |
# |id_2|5  |6  |
# |id_2|7  |8  |
# +----+---+---+

Spark: How to parse and transform json string from spark data frame rows

1 Answers1

Linked