0

Converted dataframe(say child dataframe) into json using df.toJSON

After json conversion the schema looks like this :

root
 |-- value: string (nullable = true)

I used the following suggestion to get child dataframe into the intermediate parent schema/dataframe:

scala> parentDF.toJSON.select(struct($"value").as("data")).printSchema
root
 |-- data: struct (nullable = false)
 |    |-- value: string (nullable = true)

Now I still need to build the parentDF schema further to make it look like:

root
 |-- id
 |-- version 
 |-- data: struct (nullable = false)
 |    |-- value: string (nullable = true)

Q1) How can I build the id column using the id from value(i.e value.id needs to be represented as id)

Q2) I need to bring version from a different dataframe(say versionDF) where version is a constant(in all columns). Do I fetch one row from this versionDF to read value of version column and then populate it as literal in the parentDF ?

please help with any code snippets on this.

bali89
  • 3
  • 1
  • 2

2 Answers2

3

Instead of toJSON use to_json in select statement & select required columns along with to_json function.

Check below code.

val version =  // Get version value from versionDF
parentDF.select($"id",struct(to_json(struct($"*")).as("value")).as("data"),lit(version).as("version"))

scala> parentDF.select($"id",struct(to_json(struct($"*")).as("value")).as("data"),lit(version).as("version")).printSchema
root
 |-- id: integer (nullable = false)
 |-- data: struct (nullable = false)
 |    |-- value: string (nullable = true)
 |-- version: double (nullable = false)

Updated

scala> df.select($"id",to_json(struct(struct($"*").as("value"))).as("data"),lit(version).as("version")).printSchema
root
 |-- id: integer (nullable = false)
 |-- data: string (nullable = true)
 |-- version: integer (nullable = false)


scala> df.select($"id",to_json(struct(struct($"*").as("value"))).as("data"),lit(version).as("version")).show(false)
+---+------------------------------------------+-------+
|id |data                                      |version|
+---+------------------------------------------+-------+
|1  |{"value":{"id":1,"col1":"a1","col2":"b1"}}|1      |
|2  |{"value":{"id":2,"col1":"a2","col2":"b2"}}|1      |
|3  |{"value":{"id":3,"col1":"a3","col2":"b3"}}|1      |
+---+------------------------------------------+-------+

Update-1

scala> df.select($"id",to_json(struct($"*").as("value")).as("data"),lit(version).as("version")).printSchema
root
 |-- id: integer (nullable = false)
 |-- data: string (nullable = true)
 |-- version: integer (nullable = false)


scala> df.select($"id",to_json(struct($"*").as("value")).as("data"),lit(version).as("version")).show(false)
+---+--------------------------------+-------+
|id |data                            |version|
+---+--------------------------------+-------+
|1  |{"id":1,"col1":"a1","col2":"b1"}|1      |
|2  |{"id":2,"col1":"a2","col2":"b2"}|1      |
|3  |{"id":3,"col1":"a3","col2":"b3"}|1      |
+---+--------------------------------+-------+

Srinivas
  • 8,957
  • 2
  • 12
  • 26
  • This approach is good. However when i access the 'data' I see a jsonarray enclosure. need the data as jsonobject. is there a code adjustment which will achieve the same. – bali89 Jun 04 '20 at 04:16
  • can you please output how it is coming & how you want ? – Srinivas Jun 04 '20 at 04:17
  • Current content of data looks like this : [{"id":1,"col1":"a1","col2":"b1"}] Expected format of data: {"id":1,"col1":"a1","col2":"b1"} NOTE: no [ ] enclosure in the latter one - which is how its expected(jsonobject). Thanks for help. – bali89 Jun 04 '20 at 04:20
  • Any specific reason why you want object, trying to understand problem ?? – Srinivas Jun 04 '20 at 04:25
  • It is more of a requirement for this program and also it is natural(I think) to have data represented in jsonobject format by default. jsonarray represents more like a collection of jsonobjects - and 'data' field here is just a json object. – bali89 Jun 04 '20 at 04:34
  • One reason why you are seeing array is you asked move value one level down to data field. if you want only object you can directly use value – Srinivas Jun 04 '20 at 04:39
  • is it not possible to have data contain a nested jsonobject ? because that is what I'm looking for. – bali89 Jun 04 '20 at 05:54
  • you will get data like this - {"value":{"id":1,"cola":"aa","colb":"bb"}} is it ok ?? – Srinivas Jun 04 '20 at 05:59
  • ideally 'data' should be {"id":1,"cola":"aa","colb":"bb"}. feel free to add code for your suggestion though. – bali89 Jun 04 '20 at 06:21
1

Try this:

scala> val versionDF = List((1.0)).toDF("version")
versionDF: org.apache.spark.sql.DataFrame = [version: double]

scala> versionDF.show
+-------+
|version|
+-------+
|    1.0|
+-------+


scala> val version = versionDF.first.get(0)
version: Any = 1.0

scala>

scala> val childDF = List((1,"a1","b1"),(2,"a2","b2"),(3,"a3","b3")).toDF("id","col1","col2")
childDF: org.apache.spark.sql.DataFrame = [id: int, col1: string ... 1 more field]

scala> childDF.show
+---+----+----+
| id|col1|col2|
+---+----+----+
|  1|  a1|  b1|
|  2|  a2|  b2|
|  3|  a3|  b3|
+---+----+----+


scala>

scala> val parentDF =  childDF.toJSON.select(struct($"value").as("data")).withColumn("id",from_json($"data.value",childDF.schema).getItem("id")).withColumn("version",lit(version))
parentDF: org.apache.spark.sql.DataFrame = [data: struct<value: string>, id: int ... 1 more field]

scala> parentDF.printSchema
root
 |-- data: struct (nullable = false)
 |    |-- value: string (nullable = true)
 |-- id: integer (nullable = true)
 |-- version: double (nullable = false)

scala> parentDF.show(false)
+----------------------------------+---+-------+
|data                              |id |version|
+----------------------------------+---+-------+
|[{"id":1,"col1":"a1","col2":"b1"}]|1  |1.0    |
|[{"id":2,"col1":"a2","col2":"b2"}]|2  |1.0    |
|[{"id":3,"col1":"a3","col2":"b3"}]|3  |1.0    |
+----------------------------------+---+-------+
Parvez Patel
  • 211
  • 2
  • 4
  • Thanks for the suggestion. why is the data being reflected as jsonarray when we perform a toJSON or to_json. any way we can get the 'data' represented a jsonobject ? – bali89 Jun 04 '20 at 04:18
  • `struct` function makes it look like array. e.g. 'childDF.select(struct(lit("col1val"),lit("col2val")).as("structColumn")).show | structColumn| +------------------+ |[col1val, col2val]|' If you dont want it, just use below: `childDF.toJSON.withColumnRenamed("value","data").show(false) +--------------------------------+ |data | +--------------------------------+ |{"id":1,"col1":"a1","col2":"b1"}| |{"id":2,"col1":"a2","col2":"b2"}| |{"id":3,"col1":"a3","col2":"b3"}| +--------------------------------+` – Parvez Patel Jun 04 '20 at 16:41