when I am trying to write a spark dataframe as a json file , the column which have null values is getting dropped from json string , I would like to save the key as it is with null value .
Below is the code :
where key "Six" has value as None for the second row in the dataframe
from pyspark.sql import SparkSession
import json
from pyspark.sql.functions import col, collect_list, struct
# Create a Spark session
spark = SparkSession.builder.appName("DataFrameToJson").getOrCreate()
# Sample DataFrame
data = [
{"first": "1", "second": "2", "third": "3", "fourth": "4", "fifth": 5, "six": 6},
{"first": "11", "second": "22", "third": "33", "fourth": "44", "fifth": 55, "six": None}
]
# Create a DataFrame
final_df = spark.createDataFrame(data)
df_json = final_df.toJSON().collect()
df_list_of_dicts = [json.loads(x) for x in df_json]
json_object = json.dumps(df_list_of_dicts)
with open("output.json","w") as f:
f.write(json_object)
content of the file output.json after the code run is :
[
{
"fifth": 5,
"first": "1",
"fourth": "4",
"second": "2",
"six": 6,
"third": "3"
},
{
"fifth": 55,
"first": "11",
"fourth": "44",
"second": "22",
"third": "33"
}
]
and my expected out is :
[
{
"fifth": 5,
"first": "1",
"fourth": "4",
"second": "2",
"six": 6,
"third": "3"
},
{
"fifth": 55,
"first": "11",
"fourth": "44",
"second": "22",
"third": "33",
"six": null
}
]