Pyspark: explode json in column to multiple columns

Question

The data looks like this -

+-----------+-----------+-----------------------------+
|         id|      point|                         data|
+-----------------------------------------------------+
|        abc|          6|{"key1":"124", "key2": "345"}|
|        dfl|          7|{"key1":"777", "key2": "888"}|
|        4bd|          6|{"key1":"111", "key2": "788"}|

I am trying to break it into the following format.

+-----------+-----------+-----------+-----------+
|         id|      point|       key1|       key2|
+------------------------------------------------
|        abc|          6|        124|        345|
|        dfl|          7|        777|        888|
|        4bd|          6|        111|        788|

The explode function explodes the dataframe into multiple rows. But that is not the desired solution.

Note: This solution does not answers my questions. PySpark "explode" dict in column

score 52 · Accepted Answer · edited Jun 27 '18 at 20:31

52

As long as you are using Spark version 2.1 or higher, pyspark.sql.functions.from_json should get you your desired result, but you would need to first define the required schema

from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType

schema = StructType(
    [
        StructField('key1', StringType(), True),
        StructField('key2', StringType(), True)
    ]
)

df.withColumn("data", from_json("data", schema))\
    .select(col('id'), col('point'), col('data.*'))\
    .show()

which should give you

+---+-----+----+----+
| id|point|key1|key2|
+---+-----+----+----+
|abc|    6| 124| 345|
|df1|    7| 777| 888|
|4bd|    6| 111| 788|
+---+-----+----+----+

edited Jun 27 '18 at 20:31

pault

41,343
15
107
149

answered Jun 27 '18 at 19:54

Ramesh Maharjan

41,071
6
69
97

6

You should be able to use something the following to extract the schema of the JSON from the data field... `schema = spark.read.json(df.rdd.map(lambda row: row.data)).schema` – Simon Peacock Jan 15 '20 at 13:28
1

Is there any way to do this without supplying a schema? In the context of spark streaming jobs, the above schema extraction is not an option @SimonPeacock, writing down the complete schema is .. messy (to say the least) and also quite unflexible as I want additional fields to appear without having to adapt and restart the whole streaming job – denise Jul 02 '20 at 09:26
get schema using `df.schema` and don't forget to use all dataTypes as `StringType()` else it may give `null` values for other data types as well as for StringTypes – Ankit Agrawal Jun 11 '21 at 16:57
1

In case you want to select all rest of the DF columns and also expan the json column use following `df2 = df.select("*", col("data.*"))` – Shrikant Prabhu Jul 23 '21 at 23:36

jxc · Answer 2 · 2018-06-28T02:01:48.443

7

As suggested by @pault, the data field is a string field. since the keys are the same (i.e. 'key1', 'key2') in the JSON string over rows, you might also use json_tuple() (this function is New in version 1.6 based on the documentation)

from pyspark.sql import functions as F

df.select('id', 'point', F.json_tuple('data', 'key1', 'key2').alias('key1', 'key2')).show()

Below is My original post: which is most likely WRONG if the original table is from df.show(truncate=False) and thus the data field is NOT a python data structure.

Since you have exploded the data into rows, I supposed the column data is a Python data structure instead of a string:

from pyspark.sql import functions as F

df.select('id', 'point', F.col('data').getItem('key1').alias('key1'), F.col('data')['key2'].alias('key2')).show()

edited Jun 28 '18 at 02:01

answered Jun 27 '18 at 20:47

jxc

13,553
4
16
34

I don't think this works in this case- you need a `MapType()` column to use `getItem()` but it looks like it's a string here. – pault Jun 27 '18 at 20:50
the OP mentioned the results had been exploded into multiple rows, this does not sounds to be a string field. – jxc Jun 27 '18 at 20:51
*"The explode function explodes the dataframe into multiple rows."* sounds like OP is stating a fact, rather than what they have tried. Also, if it were a `MapType()` it would not display as shown in the post. – pault Jun 27 '18 at 20:53
1

thanks, I think you might be right. but I think it can be much simpler when keys are constant in the JSON strings. – jxc Jun 28 '18 at 01:46
this works. I did not know about `json_tuple` - it's much easier than defining the schema. – pault Jun 28 '18 at 14:54
struggled 1 day with looping, this is really kool. – Arpit Sisodia May 07 '19 at 05:01

Brandon · Answer 3 · 2022-06-20T09:57:52.993

As mentioned by @jxc, json_tuple should work fine if you were not able to define the schema beforehand and you only needed to deal with a single level of json string. I think it's more straight forward and easier to use. Strangely, I didn't find anyone else mention this function before.

In my use case, original dataframe schema: StructType(List(StructField(a,StringType,true))), json string column shown as:

+---------------------------------------+
|a                                      |
+---------------------------------------+
|{"k1": "v1", "k2": "2", "k3": {"m": 1}}|
|{"k1": "v11", "k3": "v33"}             |
|{"k1": "v13", "k2": "23"}              |
+---------------------------------------+

Expand json fields into new columns with json_tuple:

from pyspark.sql import functions as F

df = df.select(F.col('a'), 
    F.json_tuple(F.col('a'), 'k1', 'k2', 'k3') \
    .alias('k1', 'k2', 'k3'))

df.schema
df.show(truncate=False)

The document doesn't say much about it, but at least in my use case, new columns extracted by json_tuple are StringType, and it only extract single depth of JSON string.

StructType(List(StructField(k1,StringType,true),StructField(k2,StringType,true),StructField(k3,StringType,true)))

+---------------------------------------+---+----+-------+
|a                                      |k1 |k2  |k3     |
+---------------------------------------+---+----+-------+
|{"k1": "v1", "k2": "2", "k3": {"m": 1}}|v1 |2   |{"m":1}|
|{"k1": "v11", "k3": "v33"}             |v11|null|v33    |
|{"k1": "v13", "k2": "23"}              |v13|23  |null   |
+---------------------------------------+---+----+-------+

score 0 · Answer 4 · answered Jun 05 '21 at 18:37

This works for my use case

data1 = spark.read.parquet(path)
json_schema = spark.read.json(data1.rdd.map(lambda row: row.json_col)).schema
data2 = data1.withColumn("data", from_json("json_col", json_schema))
col1 = data2.columns
col1.remove("data")
col2 = data2.select("data.*").columns
append_str ="data."
col3 = [append_str + val for val in col2]
col_list = col1 + col3
data3 = data2.select(*col_list).drop("json_col")

score 0 · Answer 5 · answered Nov 26 '21 at 10:00

0

All credits to Shrikant Prabhu

You can simply use SQL

SELECT id, point, data.*
FROM original_table

Like this the schema of the new table will adapt if the data changes and you won't have to do anything in your pipelin.

answered Nov 26 '21 at 10:00

Yannick Widmer

1,286
3
19
30

score 0 · Answer 6 · edited Jul 11 '23 at 08:56

0

In this approach you just need to set the name of column with Json content. No need to set up the schema. It makes everything automatically.

json_col_name = 'data'
keys = df.select(f"{json_col_name}.*").columns
jsonFields= [f"{json_col_name}.{key} {key}" for key in keys]

main_fields = [key for key in df.columns if key != json_col_name]
df_new = df.selectExpr(main_fields + jsonFields)

edited Jul 11 '23 at 08:56

MariaMsu

406
6
14

answered Jul 14 '22 at 13:04

Amin Tabatabaei

65
6

Pyspark: explode json in column to multiple columns

6 Answers6

Linked

Related