0

I am working on an FHIR resource where I am getting a JSON data like below:

{
    "appointmentRef": "Appointment/12213#4200",
    "encounterLengh": "2",
    "billingAccount": "savingsAccount",
    "hospitalization": "{\"preAdmissionIdentifierSystem\":\"https://system123445.html\",\"preAdmissionIdentifierValue\":\"pqr\",\"origin\":\"hospital\",\"admitSourceCode\":\"outp\",\"admitSourceReason\":\"some thing\",\"eid\":200,\"destination\":\"hospital\"}",
    "resourceType": "Encounter",
    "priority": "abc",
    "status": "triaged",
    "eid": "200",
    "subject": "Patient/435"
}

So, previously for the attributes which are on the root level like appointmentRef and etc..they also had "\" on the R.H.S which I was able to remove it by my code. However, as it can be seen from the data above that for the nested attributes my code isn't working.

rowList.groupBy(row => row.key).foreach(rowList => {
        import com.google.gson.{Gson, JsonObject}
        val map: Map[String, String] = mutable.Map()
        rowList._2.foreach(row => {
          LOGGER.debug(s"row == $row")
          if (Utility.isBlank(row.jsonElementTag)) {
            val convertedObject = new Gson().fromJson(row.value, classOf[JsonObject])
            val itr = convertedObject.entrySet().iterator()
            while (itr.hasNext) {
              val next = itr.next()
              val value = next.getValue.getAsString
              val key = next.getKey
              LOGGER.debug(s"key-- $key value --$value")
              map.put(key, value)
            }
          }
          else {
            val convertedObject = new Gson().fromJson(row.value, classOf[JsonObject])
            LOGGER.debug(s"convertedObject  == $convertedObject")
            if (null != map.get(row.jsonElementTag).getOrElse(null)) {
              LOGGER.debug("map.get(row.jsonElementTag).get === "+row.jsonElementTag +" "+map.get(row.jsonElementTag).get)
              var array: JsonArray = new JsonArray
              val mapElement = new Gson().fromJson(map.get(row.jsonElementTag).get, classOf[JsonObject])
              array.add(mapElement)
              array.add(convertedObject)
              map.put(row.jsonElementTag, array.toString)
            }
            else {
              map.put(row.jsonElementTag, convertedObject.toString)
            }
          }
        })

I am just taking the rows from the data frame and iterating over the rows, taking it as a string, and putting it in key-value pairs. The if loop will run for the parent level attributes and the else-if loop will be executed for the nested attributes.

I even tried the simpler way of replace("\","") but it didn't work. So, how do I remove the "\" from the nested attributes?

My expected output is there should be no "\" in my nested JSON attributes.

whatsinthename
  • 1,828
  • 20
  • 59

2 Answers2

1

hospitalization column is of type string & It contains json object. To extract or convert string to json, prepare schema as per data in that column.

Check below code.

scala> import org.apache.spark.sql.types._                                                                                                                                                         
import org.apache.spark.sql.types._                                                                                                                                                                
                                                                                                                                                                                                   
scala> val schema = DataType.fromJson("""{"type":"struct","fields":[{"name":"admitSourceCode","type":"string","nullable":true,"metadata":{}},{"name":"admitSourceReason","type":"string","nullable"
:true,"metadata":{}},{"name":"destination","type":"string","nullable":true,"metadata":{}},{"name":"eid","type":"long","nullable":true,"metadata":{}},{"name":"origin","type":"string","nullable":tr
ue,"metadata":{}},{"name":"preAdmissionIdentifierSystem","type":"string","nullable":true,"metadata":{}},{"name":"preAdmissionIdentifierValue","type":"string","nullable":true,"metadata":{}}]}""").
asInstanceOf[StructType]                                                                                                                                                                           
scala> df.withColumn("hospitalization",from_json($"hospitalization",schema)).printSchema                                                                                                           
root                                                                                                                                                                                               
 |-- appointmentRef: string (nullable = true)                                                                                                                                                      
 |-- billingAccount: string (nullable = true)                                                                                                                                                      
 |-- eid: string (nullable = true)                                                                                                                                                                 
 |-- encounterLengh: string (nullable = true)                                                                                                                                                      
 |-- hospitalization: struct (nullable = true)                                                                                                                                                     
 |    |-- admitSourceCode: string (nullable = true)                                                                                                                                                
 |    |-- admitSourceReason: string (nullable = true)                                                                                                                                              
 |    |-- destination: string (nullable = true)                                                                                                                                                    
 |    |-- eid: long (nullable = true)                                                                                                                                                              
 |    |-- origin: string (nullable = true)                                                                                                                                                         
 |    |-- preAdmissionIdentifierSystem: string (nullable = true)                                                                                                                                   
 |    |-- preAdmissionIdentifierValue: string (nullable = true)                                                                                                                                    
 |-- priority: string (nullable = true)                                                                                                                                                            
 |-- resourceType: string (nullable = true)                                                                                                                                                        
 |-- status: string (nullable = true)                                                                                                                                                              
 |-- subject: string (nullable = true)                                                                                                                                                             
scala> df.withColumn("hospitalization",from_json($"hospitalization",schema)).show(false)                                                                                                           
+----------------------+--------------+---+--------------+---------------------------------------------------------------------------+--------+------------+-------+-----------+                   
|appointmentRef        |billingAccount|eid|encounterLengh|hospitalization                                                            |priority|resourceType|status |subject    |                   
+----------------------+--------------+---+--------------+---------------------------------------------------------------------------+--------+------------+-------+-----------+                   
|Appointment/12213#4200|savingsAccount|200|2             |[outp, some thing, hospital, 200, hospital, https://system123445.html, pqr]|abc     |Encounter   |triaged|Patient/435|                   
+----------------------+--------------+---+--------------+---------------------------------------------------------------------------+--------+------------+-------+-----------+                   

Update

Created small helper class to extract or convert json without schema.

  import org.apache.spark.sql.functions._
  import org.apache.spark.sql.expressions._
  import org.json4s.JsonDSL._
  import org.json4s._
  import org.json4s.jackson.JsonMethods._

  val append = udf((rowId: Long,json: String) => {
    compact(render(Map("rowId" -> parse(rowId.toString),"data" ->parse(json))))
  })

  implicit class DFHelper(df: DataFrame) {
    import df.sparkSession.implicits._

    def parseJson = df.sparkSession.read.option("multiLine","true").json(df.map(_.getString(0)))

    //Convert string to json object or array of json object
    def extract(column: Column) = {

      val updatedDF = df.withColumn("rowId",row_number().over(Window.orderBy(lit(1))))
      val parsedDF = updatedDF.filter(column.isNotNull)
        .select(append($"rowid",column).as("row"))
        .parseJson

      updatedDF.join(
        parsedDF.select($"rowId",$"data".as(column.toString())),
        updatedDF("rowId") === parsedDF("rowId"),
        "left"
      )
          .drop("rowId") // Deleting added rowId column.
    }
  }
scala> df.extract($"hospitalization").printSchema()

root
 |-- appointmentRef: string (nullable = true)
 |-- billingAccount: string (nullable = true)
 |-- eid: string (nullable = true)
 |-- encounterLengh: string (nullable = true)
 |-- hospitalization: string (nullable = true)
 |-- priority: string (nullable = true)
 |-- resourceType: string (nullable = true)
 |-- status: string (nullable = true)
 |-- subject: string (nullable = true)
 |-- hospitalization: struct (nullable = true)
 |    |-- admitSourceCode: string (nullable = true)
 |    |-- admitSourceReason: string (nullable = true)
 |    |-- destination: string (nullable = true)
 |    |-- eid: long (nullable = true)
 |    |-- encounterLengh: string (nullable = true)
 |    |-- origin: string (nullable = true)
 |    |-- preAdmissionIdentifierSystem: string (nullable = true)
 |    |-- preAdmissionIdentifierValue: string (nullable = true)
scala> df.extract($"hospitalization").show(false)
+----------------------+--------------+---+--------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+------------+-------+-----------+------------------------------------------------------------------------------+
|appointmentRef        |billingAccount|eid|encounterLengh|hospitalization                                                                                                                                                                                                                        |priority|resourceType|status |subject    |hospitalization                                                               |
+----------------------+--------------+---+--------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+------------+-------+-----------+------------------------------------------------------------------------------+
|Appointment/12213#4200|savingsAccount|200|1             |{"encounterLengh": "1","preAdmissionIdentifierSystem":"https://system123445.html","preAdmissionIdentifierValue":"pqr","origin":"hospital","admitSourceCode":"outp","admitSourceReason":"some thing","eid":200,"destination":"hospital"}|abc     |Encounter   |triaged|Patient/435|[outp, some thing, hospital, 200, 1, hospital, https://system123445.html, pqr]|
+----------------------+--------------+---+--------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+------------+-------+-----------+------------------------------------------------------------------------------+

Srinivas
  • 8,957
  • 2
  • 12
  • 26
  • I appreciate the efforts @Srinivas but the thing is we don't have any fixed schema with us. Like, there can be multiple nested attributes in the single data, attributes length can change, data types might be different and so on..That's why I need something generic over here. – whatsinthename Jul 15 '20 at 07:36
  • I have updated code, where you just have to pass df.extract($"hospitalization").. It will extract or convert string to json object. Check once. – Srinivas Jul 15 '20 at 09:14
  • @Srinivas , is there any easy way to convert String column which contains json objects or array or struct objects to flat or new columns with "Spark SQL" instead of PySpark or any other easy ways? – sai saran Dec 06 '22 at 18:13
0

Perhaps this is helpful -

Load the test data provided

val data =
      """
        |{
        |    "appointmentRef": "Appointment/12213#4200",
        |    "encounterLengh": "2",
        |    "billingAccount": "savingsAccount",
        |    "hospitalization": "{\"preAdmissionIdentifierSystem\":\"https://system123445.html\",\"preAdmissionIdentifierValue\":\"pqr\",\"origin\":\"hospital\",\"admitSourceCode\":\"outp\",\"admitSourceReason\":\"some thing\",\"eid\":200,\"destination\":\"hospital\"}",
        |    "resourceType": "Encounter",
        |    "priority": "abc",
        |    "status": "triaged",
        |    "eid": "200",
        |    "subject": "Patient/435"
        |}
      """.stripMargin

    val ds = Seq(data).toDF()
    ds.show(false)
    ds.printSchema()

    /**
      * +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      * |value                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
      * +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      * |
      * {
      * "appointmentRef": "Appointment/12213#4200",
      * "encounterLengh": "2",
      * "billingAccount": "savingsAccount",
      * "hospitalization": "{\"preAdmissionIdentifierSystem\":\"https://system123445.html\",\"preAdmissionIdentifierValue\":\"pqr\",\"origin\":\"hospital\",\"admitSourceCode\":\"outp\",\"admitSourceReason\":\"some thing\",\"eid\":200,\"destination\":\"hospital\"}",
      * "resourceType": "Encounter",
      * "priority": "abc",
      * "status": "triaged",
      * "eid": "200",
      * "subject": "Patient/435"
      * }
      * |
      * +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      *
      * root
      * |-- value: string (nullable = true)
      */

Replace \ with ''(empty string)

    ds.withColumn("value", translate($"value", "\\", ""))
      .show(false)

    /**
      * +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      * |value                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
      * +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      * |
      * {
      * "appointmentRef": "Appointment/12213#4200",
      * "encounterLengh": "2",
      * "billingAccount": "savingsAccount",
      * "hospitalization": "{"preAdmissionIdentifierSystem":"https://system123445.html","preAdmissionIdentifierValue":"pqr","origin":"hospital","admitSourceCode":"outp","admitSourceReason":"some thing","eid":200,"destination":"hospital"}",
      * "resourceType": "Encounter",
      * "priority": "abc",
      * "status": "triaged",
      * "eid": "200",
      * "subject": "Patient/435"
      * }
      * |
      * +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      */
Som
  • 6,193
  • 1
  • 11
  • 22