0

Is there any method where i can create a json from a spark dataframe by not using those fields which are null:

Lets suppose i have a data frame:

+-------+----------------+

|   name|       hit_songs|

+-------+----------------+

|beatles|[help, hey jude]|

|  romeo|      [eres mia]|

| juliet|      null      |

+-------+----------------+

i want to convert it into a json like:

[{
name: "beatles",
hit_songs: [help, hey jude]
},
{
name: "romeo",
hit_songs: [eres mia]
},
{
name: "juliet"
}
]

i dont want the field hit_songs in the json_object if its value is null

notNull
  • 30,258
  • 4
  • 35
  • 50
Shruti Gusain
  • 67
  • 1
  • 6
  • Check this https://stackoverflow.com/questions/48261746/spark-get-only-columns-that-have-one-or-more-null-values https://stackoverflow.com/questions/47830915/how-to-drop-multiple-column-names-given-in-a-list-from-spark-dataframe – Mohd Bilal May 16 '20 at 10:29

1 Answers1

0

Use to_json function for this case.


df=spark.createDataFrame([("beatles",["help","hey juude"]),("romeo",["eres mia"]),("juliet",None)],["name","hit_songs"])

from pyspark.sql.functions import *

df.groupBy(lit(1)).\
agg(collect_list(to_json(struct('name','hit_songs'))).alias("json")).\
drop("1").\
show(10,False)
#+-------------------------------------------------------------------------------------------------------------------+
#|json                                                                                                               |
#+-------------------------------------------------------------------------------------------------------------------+
#|[{"name":"beatles","hit_songs":["help","hey juude"]}, {"name":"romeo","hit_songs":["eres mia"]}, {"name":"juliet"}]|
#+-------------------------------------------------------------------------------------------------------------------+

#using toJSON function.
df.groupBy(lit(1)).\
agg(collect_list(struct('name','hit_songs')).alias("json")).\
drop("1").\
toJSON().\
collect()
#[u'{"json":[{"name":"beatles","hit_songs":["help","hey juude"]},{"name":"romeo","hit_songs":["eres mia"]},{"name":"juliet"}]}']
notNull
  • 30,258
  • 4
  • 35
  • 50
  • I am sorry, this did not work, as i am further doing the repartition. Me along with my team have moved to a different approach. Can you help me on the other solution. Your previous solution worked so i can rely on you for this too: https://stackoverflow.com/questions/61838445/add-json-object-field-to-a-json-array-field-in-the-dataframe-using-scala – Shruti Gusain May 16 '20 at 14:37
  • @ShrutiGusain, `repartition` doesn't cause any issues.. and I'm not sure what error you are getting using to_json function. – notNull May 16 '20 at 14:41
  • when i am converting my dataframe using tojson it shows the value as { name: "juliet", hit_songs: null } which is causing null exception in api of our application – Shruti Gusain May 16 '20 at 14:45
  • `to_json` function doesn't preserve null's, if you have `null` string in your dataframe convert to null then convert back to json object. as in my example i have `hit_songs` as `null` and `hit_songs` is not part of `juliet` json object. – notNull May 16 '20 at 14:49