2

I have a json column which can contain any no of key:value pairs. I want to create new top level columns for these key:value pairs. For Eg if I have this data

A                                       B
"{\"C\":\"c\" , \"D\":\"d\"...}"        b

This is the output that i want

B   C   D  ...
b   c   d

There are few questions similar to splitting the coulmns into multiple columns but none are working in this case. Can Anyone please help. Thanks in Advance!

gashu
  • 863
  • 2
  • 10
  • 21
  • Dose all the json column's value have same schema or contain array? – Zhang Tong Mar 21 '17 at 01:39
  • The json column's value have different schema, contains different key:value pairs. We can use json.loads to parse this column, the column value is in json format. But I want to know how can i create top level columns while parsing this value? – gashu Mar 21 '17 at 02:55

1 Answers1

2

You are looking for org.apache.spark.sql.functions.from_json: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$@from_json(e:org.apache.spark.sql.Column,schema:String,options:java.util.Map[String,String]):org.apache.spark.sql.Column

Here's the python code commit related to SPARK-17699: https://github.com/apache/spark/commit/fe33121a53384811a8e094ab6c05dc85b7c7ca87

Sample Usage from commit:

    >>> from pyspark.sql.types import *
    >>> data = [(1, '''{"a": 1}''')]
    >>> schema = StructType([StructField("a", IntegerType())])
    >>> df = spark.createDataFrame(data, ("key", "value"))
    >>> df.select(from_json(df.value, schema).alias("json")).collect()
    [Row(json=Row(a=1))]
Garren S
  • 5,552
  • 3
  • 30
  • 45
  • Thanks , how can I add multiple columns at the same time , as I have multiple key:value pairs? – gashu Mar 21 '17 at 06:44
  • Try `df.select(from_json(df.value, schema), from_json(df.value2, schema2))` ? – Garren S Mar 21 '17 at 15:29
  • This may be related to your problem: https://issues.apache.org/jira/browse/SPARK-19595 - unfortunately if this is a limitation for you, you will have to wait until Spark 2.2 or create a custom solution now. – Garren S Mar 21 '17 at 17:31