how to write null values of pysprak dataframe using toJSON() , as the key where values are null are getting dropped

Question

when I am trying to write a spark dataframe as a json file , the column which have null values is getting dropped from json string , I would like to save the key as it is with null value .

Below is the code :

where key "Six" has value as None for the second row in the dataframe

from pyspark.sql import SparkSession
import json
from pyspark.sql.functions import col, collect_list, struct
# Create a Spark session
spark = SparkSession.builder.appName("DataFrameToJson").getOrCreate()

# Sample DataFrame

data = [

    {"first": "1", "second": "2", "third": "3", "fourth": "4", "fifth": 5, "six": 6},

    {"first": "11", "second": "22", "third": "33", "fourth": "44", "fifth": 55, "six": None}

]

# Create a DataFrame

final_df = spark.createDataFrame(data)
df_json = final_df.toJSON().collect()
df_list_of_dicts = [json.loads(x) for x in df_json]
json_object = json.dumps(df_list_of_dicts)
with open("output.json","w") as f:
            f.write(json_object)


content of the file output.json after the code run is  :

[
    {
        "fifth": 5,
        "first": "1",
        "fourth": "4",
        "second": "2",
        "six": 6,
        "third": "3"
    },
    {
        "fifth": 55,
        "first": "11",
        "fourth": "44",
        "second": "22",
        "third": "33"
    }
]

and my expected out is :
[
    {
        "fifth": 5,
        "first": "1",
        "fourth": "4",
        "second": "2",
        "six": 6,
        "third": "3"
    },
    {
        "fifth": 55,
        "first": "11",
        "fourth": "44",
        "second": "22",
        "third": "33",
         "six": null
    }
]

score 0 · Answer 1 · answered Aug 18 '23 at 05:04

0

Your question is similar to retain-keys-with-null-values-while-writing-json-in-spark. If you are using Spark 3 then you just need to add an extra config while creating Spark session.

spark = SparkSession.builder.appName("DataFrameToJson").config("spark.sql.jsonGenerator.ignoreNullFields", "false").getOrCreate()

This will give you

df_json => 
[
'{"fifth":5,"first":"1","fourth":"4","second":"2","six":6,"third":"3"}', 
'{"fifth":55,"first":"11","fourth":"44","second":"22","six":null,"third":"33"}'
]

answered Aug 18 '23 at 05:04

Rahul Sahoo

89
7

Thank you , I am using Spark3 and the solution worked . – defnoteg Aug 21 '23 at 02:36
Please mark this as the correct answer. Thank You. – Rahul Sahoo Aug 21 '23 at 04:29

how to write null values of pysprak dataframe using toJSON() , as the key where values are null are getting dropped

1 Answers1