0

I want to parse a JSON request and create multiple columns out of it in pyspark as follows:

{
  "ID": "abc123",
  "device": "mobile",
  "Ads": [
    {
      "placement": "topright",
      "Adlist": [
        {
          "name": "ad1",
          "subtype": "placeholder1",
          "category": "socialmedia",
        },
        {
          "name": "ad2",
          "subtype": "placeholder2",
          "category": "media",
        },
        {
          "name": "ad3",
          "subtype": "placeholder3",
          "category": "printingpress",
        }
      ]
    },
    {
      "Placement": "bottomleft",
      "Adlist": [
        {
          "name": "ad4",
          "subtype": "placeholder4",
          "category": "socialmedia",
        },
        {
          "name": "ad5",
          "subtype": "placeholder5",
          "category": "media",
        },
        {
          "name": "ad6",
          "subtype": "placeholder6",
          "category": "printingpress",
        }
      ]
    }
  ]
}

I tried the following:

from pyspark import SparkContext
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))

rdd = sc.parallelize([sample_json])
df = spark.read.option('multiline', "true").json(rdd)

df.printSchema()
root
 |-- Ads: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Adlist: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- category: string (nullable = true)
 |    |    |    |    |-- name: string (nullable = true)
 |    |    |    |    |-- subtype: string (nullable = true)
 |    |    |-- placement: string (nullable = true)
 |-- ID: string (nullable = true)
 |-- device: string (nullable = true)

The resulting dataframe looks like this:

+--------------------+------+------+
|                 Ads|    ID|device|
+--------------------+------+------+
|[{[{socialmedia, ...|abc123|mobile|
+--------------------+------+------+

However, the output I am looking for is as follows:

Id    | device | placement | name  | subtype      | category      |
-------------------------------------------------------------------
abc123| mobile | topright  | ad1   | placeholder1 | socialmedia   |
abc123| mobile | topright  | ad2   | placeholder2 | media         |
abc123| mobile | topright  | ad3   | placeholder3 | printingpress |
abc123| mobile |bottomleft | ad4   | placeholder4 | socialmedia   |
abc123| mobile |bottomleft | ad5   | placeholder5 | media         |
abc123| mobile |bottomleft | ad6   | placeholder6 | printingpress |
starball
  • 20,030
  • 7
  • 43
  • 238
Gingerbread
  • 1,938
  • 8
  • 22
  • 36

2 Answers2

2

A combination of explode and ".*" will do the trick:

df = df.select(col("Id"), col("device"), explode(col("Ads")).alias("Ads")) \
    .select("Id", "device", "Ads.*") \
    .select(col("Id"), col("device"), explode(col("Adlist")).alias("Adlist"), col("placement")) \
    .select("Id", "device", "placement", "Adlist.*")
df.show()

+------+------+---------+-------------+----+------------+
|    Id|device|placement|     category|name|     subtype|
+------+------+---------+-------------+----+------------+
|abc123|mobile| topright|  socialmedia| ad1|placeholder1|
|abc123|mobile| topright|        media| ad2|placeholder2|
|abc123|mobile| topright|printingpress| ad3|placeholder3|
|abc123|mobile|     null|  socialmedia| ad4|placeholder4|
|abc123|mobile|     null|        media| ad5|placeholder5|
|abc123|mobile|     null|printingpress| ad6|placeholder6|
+------+------+---------+-------------+----+------------+

FYI:

  • explode() is a function that is used to transform a column of array into multiple rows
  • ".*" is used to tranform a struct column into columns of fields of that struxt
Abdennacer Lachiheb
  • 4,388
  • 7
  • 30
  • 61
1

To explode this there are a couple of methods, if the data structure is constant and won't change you can create a struct like

df2=df.withColumn("ads",from_json(col("ads"),struct = struct))

with the struct variable being the constant data structure of your ads' data, this would look something like

struct = schema = StructType([StructField("Placement", StringType()), StructField("OrderNumber", ArrayType(StringType()), ...., .....)], (you'll need to edit this to conform to your data)

Another method is star expansion, which you can use like

df = spark.read.option('multiline','true').json(ads_file_path).withColumn("ads", explode("ads")).select("ads.*", "*").drop("ads")

And then reusing this method if there are nested lists.

Hope this helps

Abdennacer Lachiheb
  • 4,388
  • 7
  • 30
  • 61