Pyspark dataframe split json column values into top-level multiple columns

Question

I have a json column which can contain any no of key:value pairs. I want to create new top level columns for these key:value pairs. For Eg if I have this data

A                                       B
"{\"C\":\"c\" , \"D\":\"d\"...}"        b

This is the output that i want

B   C   D  ...
b   c   d

There are few questions similar to splitting the coulmns into multiple columns but none are working in this case. Can Anyone please help. Thanks in Advance!

Dose all the json column's value have same schema or contain array? — Zhang Tong, Mar 21 '17 at 01:39
The json column's value have different schema, contains different key:value pairs. We can use json.loads to parse this column, the column value is in json format. But I want to know how can i create top level columns while parsing this value? — gashu, Mar 21 '17 at 02:55

score 2 · Answer 1 · answered Mar 21 '17 at 05:31

2

You are looking for org.apache.spark.sql.functions.from_json: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$@from_json(e:org.apache.spark.sql.Column,schema:String,options:java.util.Map[String,String]):org.apache.spark.sql.Column

Here's the python code commit related to SPARK-17699: https://github.com/apache/spark/commit/fe33121a53384811a8e094ab6c05dc85b7c7ca87

Sample Usage from commit:

    >>> from pyspark.sql.types import *
    >>> data = [(1, '''{"a": 1}''')]
    >>> schema = StructType([StructField("a", IntegerType())])
    >>> df = spark.createDataFrame(data, ("key", "value"))
    >>> df.select(from_json(df.value, schema).alias("json")).collect()
    [Row(json=Row(a=1))]

answered Mar 21 '17 at 05:31

Garren S

5,552
3
30
45

Thanks , how can I add multiple columns at the same time , as I have multiple key:value pairs? – gashu Mar 21 '17 at 06:44
Try `df.select(from_json(df.value, schema), from_json(df.value2, schema2))` ? – Garren S Mar 21 '17 at 15:29
This may be related to your problem: https://issues.apache.org/jira/browse/SPARK-19595 - unfortunately if this is a limitation for you, you will have to wait until Spark 2.2 or create a custom solution now. – Garren S Mar 21 '17 at 17:31

Pyspark dataframe split json column values into top-level multiple columns

1 Answers1