-2

I have json messages that I want to parse and store into relational db tables. The json messages have multiple levels of arrays. For example:

{
   "orderid": "123",
   "orderdate": "2021-12-23",
   "orderlines": [
      {
         "orderlinenum": 1,
         "itemid: "a123",
         "itemcost": 10.20
      },
      {
         "orderlinenum": 2,
         "itemid: "a999",
         "itemcost": 15.23
      }
   ]
}

I want to read this, parse, and store in 2 tables order_header and order_line. In this example my order header would have 1 record and I would have 2 order line records. The order lines need to have the order id as well in order to relate the lines to the header.

This is just an simple example. In practice there are multiple arrays, and some arrays have nested arrays. So I don't want to just normalize the whole thing. I really want to read the header level into a data frame / rdp, then the "line level" into another data frame, etc.

halfer
  • 19,824
  • 17
  • 99
  • 186
k50
  • 1
  • You can start by taking a look at this [one](https://stackoverflow.com/questions/51122967/spark-split-nested-json-into-rows) – blackbishop Dec 23 '21 at 16:40
  • 1
    Note that writing posts in all-lower-case isn't a great approach here - Stack Overflow is neither a chatroom, a forum, or social media. The community regards questions and answers like documentation, and some effort is expected. – halfer Jan 03 '22 at 00:44

1 Answers1

0
   val jsonSource = """
      {
    "orderid":"123",
    "orderdate":"2021-12-23",
    "orderlines":[
        {
          "orderlinenum":1,
          "itemid":"a123",
          "itemcost":10.20
        },
        {
          "orderlinenum":2,
          "itemid":"a999",
          "itemcost":15.23
        }
    ]
  } """

    val rdd1 = spark.sparkContext.makeRDD(jsonSource :: Nil)
    val df1 = spark.read.json(rdd1)
    // +----------+-------+-----------------------------------+
    // |orderdate |orderid|orderlines                         |
    // +----------+-------+-----------------------------------+
    // |2021-12-23|123    |[[10.2, a123, 1], [15.23, a999, 2]]|
    // +----------+-------+-----------------------------------+

    val orderLineTmpDF = df1.select(col("orderid"), col("orderlines"))
    .withColumn("orderlines", explode(col("orderlines")))
    // +-------+----------------+
    // |orderid|orderlines      |
    // +-------+----------------+
    // |123    |[10.2, a123, 1] |
    // |123    |[15.23, a999, 2]|
    // +-------+----------------+

    val orderDF = df1.select(col("orderid"), col("orderdate"))
    // +-------+----------+
    // |orderid|orderdate |
    // +-------+----------+
    // |123    |2021-12-23|
    // +-------+----------+

    val orderLineDF = orderLineTmpDF
      .withColumn("orderlinenum", col("orderlines").getItem("orderlinenum"))
      .withColumn("itemid", col("orderlines").getItem("itemid"))
      .withColumn("itemcost", col("orderlines").getItem("itemcost"))
      .drop("orderlines")
    
    // +-------+------------+------+--------+
    // |orderid|orderlinenum|itemid|itemcost|
    // +-------+------------+------+--------+
    // |123    |1           |a123  |10.2    |
    // |123    |2           |a999  |15.23   |
    // +-------+------------+------+--------+
mvasyliv
  • 1,214
  • 6
  • 10