3

I have a dataframe in which one of the string type column contains a list of items that I want to explode and make it part of the parent dataframe. How can I do it?

Here is the code to create a sample dataframe:

from pyspark.sql import Row
from collections import OrderedDict

def convert_to_row(d: dict) -> Row:
    return Row(**OrderedDict(sorted(d.items())))

df=sc.parallelize([{"arg1": "first", "arg2": "John", "arg3" : '[{"name" : "click", "datetime" : "1570103345039", "event" : "entry" }, {"name" : "drag", "datetime" : "1580133345039", "event" : "exit" }]'},{"arg1": "second", "arg2": "Joe", "arg3": '[{"name" : "click", "datetime" : "1670105345039", "event" : "entry" }, {"name" : "drop", "datetime" : "1750134345039", "event" : "exit" }]'},{"arg1": "third", "arg2": "Jane", "arg3" : '[{"name" : "click", "datetime" : "1580105245039", "event" : "entry" }, {"name" : "drop", "datetime" : "1650134345039", "event" : "exit" }]'}]) \
    .map(convert_to_row).toDF()

Running this code will create a dataframe as shown below:

+------+----+--------------------+
|  arg1|arg2|                arg3|
+------+----+--------------------+
| first|John|[{"name" : "click...|
|second| Joe|[{"name" : "click...|
| third|Jane|[{"name" : "click...|
+------+----+--------------------+

The arg3 column contains a list which I want to explode it into the detailed columns. I want the dataframe as follows:

arg1 | arg2 | arg3 | name | datetime | event

How can I achieve that?

Arvind Kandaswamy
  • 1,821
  • 3
  • 21
  • 30
  • Possible duplicate of [Pyspark: explode json in column to multiple columns](https://stackoverflow.com/questions/51070251/pyspark-explode-json-in-column-to-multiple-columns?noredirect=1&lq=1). – giser_yugang Oct 21 '19 at 07:39
  • Not exactly the same. One difference is in my third column that I want to expand, it is a list of items. I want to explode such that it returns the items in multiple rows. Then I can apply from_json as shown in the link you provided. I want to know how I can break the list of items to multiple rows. – Arvind Kandaswamy Oct 21 '19 at 10:30

1 Answers1

3

You need to specify array to the schema in from_json function:

from pyspark.sql.functions import explode, from_json

schema = 'array<struct<name:string,datetime:string,event:string>>'

df.withColumn('data', explode(from_json('arg3', schema))) \
  .select(*df.columns, 'data.*') \
  .show()                     
+------+----+--------------------+-----+-------------+-----+
|  arg1|arg2|                arg3| name|     datetime|event|
+------+----+--------------------+-----+-------------+-----+
| first|John|[{"name" : "click...|click|1570103345039|entry|
| first|John|[{"name" : "click...| drag|1580133345039| exit|
|second| Joe|[{"name" : "click...|click|1670105345039|entry|
|second| Joe|[{"name" : "click...| drop|1750134345039| exit|
| third|Jane|[{"name" : "click...|click|1580105245039|entry|
| third|Jane|[{"name" : "click...| drop|1650134345039| exit|
+------+----+--------------------+-----+-------------+-----+

Note: if your Spark version does not support simpleString format for schema, try the following:

from pyspark.sql.types import ArrayType, StringType, StructType, StructField

schema = ArrayType(
    StructType([
          StructField('name',StringType())
        , StructField('datetime',StringType())
        , StructField('event',StringType())
    ])
)
jxc
  • 13,553
  • 4
  • 16
  • 34