How to transform JSON to relational database tables using Spark

Question

I have json messages that I want to parse and store into relational db tables. The json messages have multiple levels of arrays. For example:

{
   "orderid": "123",
   "orderdate": "2021-12-23",
   "orderlines": [
      {
         "orderlinenum": 1,
         "itemid: "a123",
         "itemcost": 10.20
      },
      {
         "orderlinenum": 2,
         "itemid: "a999",
         "itemcost": 15.23
      }
   ]
}

I want to read this, parse, and store in 2 tables order_header and order_line. In this example my order header would have 1 record and I would have 2 order line records. The order lines need to have the order id as well in order to relate the lines to the header.

This is just an simple example. In practice there are multiple arrays, and some arrays have nested arrays. So I don't want to just normalize the whole thing. I really want to read the header level into a data frame / rdp, then the "line level" into another data frame, etc.

You can start by taking a look at this [one](https://stackoverflow.com/questions/51122967/spark-split-nested-json-into-rows) — blackbishop, Dec 23 '21 at 16:40
Note that writing posts in all-lower-case isn't a great approach here - Stack Overflow is neither a chatroom, a forum, or social media. The community regards questions and answers like documentation, and some effort is expected. — halfer, Jan 03 '22 at 00:44

score 0 · Answer 1 · answered Dec 24 '21 at 06:11

   val jsonSource = """
      {
    "orderid":"123",
    "orderdate":"2021-12-23",
    "orderlines":[
        {
          "orderlinenum":1,
          "itemid":"a123",
          "itemcost":10.20
        },
        {
          "orderlinenum":2,
          "itemid":"a999",
          "itemcost":15.23
        }
    ]
  } """

    val rdd1 = spark.sparkContext.makeRDD(jsonSource :: Nil)
    val df1 = spark.read.json(rdd1)
    // +----------+-------+-----------------------------------+
    // |orderdate |orderid|orderlines                         |
    // +----------+-------+-----------------------------------+
    // |2021-12-23|123    |[[10.2, a123, 1], [15.23, a999, 2]]|
    // +----------+-------+-----------------------------------+

    val orderLineTmpDF = df1.select(col("orderid"), col("orderlines"))
    .withColumn("orderlines", explode(col("orderlines")))
    // +-------+----------------+
    // |orderid|orderlines      |
    // +-------+----------------+
    // |123    |[10.2, a123, 1] |
    // |123    |[15.23, a999, 2]|
    // +-------+----------------+

    val orderDF = df1.select(col("orderid"), col("orderdate"))
    // +-------+----------+
    // |orderid|orderdate |
    // +-------+----------+
    // |123    |2021-12-23|
    // +-------+----------+

    val orderLineDF = orderLineTmpDF
      .withColumn("orderlinenum", col("orderlines").getItem("orderlinenum"))
      .withColumn("itemid", col("orderlines").getItem("itemid"))
      .withColumn("itemcost", col("orderlines").getItem("itemcost"))
      .drop("orderlines")
    
    // +-------+------------+------+--------+
    // |orderid|orderlinenum|itemid|itemcost|
    // +-------+------------+------+--------+
    // |123    |1           |a123  |10.2    |
    // |123    |2           |a999  |15.23   |
    // +-------+------------+------+--------+

How to transform JSON to relational database tables using Spark

1 Answers1