1

Let's say I have a dataframe like this

val df = Seq((2012, 8, "Batman", 9.8), (2012, 8, "Hero", 8.7), (2012, 7, "Robot", 5.5), (2011, 7, "Git", 2.0)).toDF("year", "month", "title", "rating")

I can convert the whole row to json like this.

val jsonColumns = df.select("year", "month", "title", "rating").toJSON

How can I get only the month, year and title in json and year to be out of the json object, but still part of the dataframe (so i can use it as partition column)?

Srinivas
  • 2,010
  • 7
  • 26
  • 51

1 Answers1

2

How can I get only the month, year and title in json and year to be out of the json object, but still part of the dataframe (so i can use it as partition column)

You can use to_json and struct inbuilt functions as

import org.apache.spark.sql.functions._
df.select(to_json(struct(col("year"), col("month"), col("title"))).as("json"), col("year"))

You should get output as

+----------------------------------------+----+
|json                                    |year|
+----------------------------------------+----+
|{"year":2012,"month":8,"title":"Batman"}|2012|
|{"year":2012,"month":8,"title":"Hero"}  |2012|
|{"year":2012,"month":7,"title":"Robot"} |2012|
|{"year":2011,"month":7,"title":"Git"}   |2011|
+----------------------------------------+----+

Updated

to_json function would return a json string. If you require a json object then to_json function is not needed. Instead you can do the following

df.select(struct(col("year"), col("month"), col("title")).as("json"), col("year")).toJSON
Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
  • Maharajan, thanks. Now when I write the data into a table with strict object it is not possible. cause it is only a string, not a json object. can you please tell me how do i convert this object and load it in a struct column. – Srinivas Feb 19 '18 at 13:50
  • @Srinivas, for that case don't use to_json function. just use struct as `df.select(struct(col("year"), col("month"), col("title")).as("json"), col("year")).toJSON` and you should be fine. And its not Maharajan, it Maharjan. And I am from Nepal. – Ramesh Maharjan Feb 20 '18 at 12:10
  • @Maharjan, thanks for the reply. When I try this all the double quotes are backslashes like \”year\” and for this reason the hive table cannot read the data. Apologies for typing your name wrong, I assumed you were South Indian through your first name. – Srinivas Feb 20 '18 at 13:40
  • I don't use hive so I cannot tell you the extact solution. When I write to a text file its working fine. did you try removing .toJSON – Ramesh Maharjan Feb 20 '18 at 13:50
  • @Maharjan, you are right, it is the same in my machine as well. It is hive’s inability to decipher \. I will try to write something like a json parser. – Srinivas Feb 20 '18 at 14:14
  • Thanks for your help – Srinivas Feb 20 '18 at 14:15