16

I am trying to write a JSON file using spark. There are some keys that have null as value. These show up just fine in the DataSet, but when I write the file, the keys get dropped. How do I ensure they are retained?

code to write the file:

ddp.coalesce(20).write().mode("overwrite").json("hdfs://localhost:9000/user/dedupe_employee");

part of JSON data from source:

"event_header": {
        "accept_language": null,
        "app_id": "App_ID",
        "app_name": null,
        "client_ip_address": "IP",
        "event_id": "ID",
        "event_timestamp": null,
        "offering_id": "Offering",
        "server_ip_address": "IP",
        "server_timestamp": 1492565987565,
        "topic_name": "Topic",
        "version": "1.0"
    }

Output:

"event_header": {
        "app_id": "App_ID",
        "client_ip_address": "IP",
        "event_id": "ID",
        "offering_id": "Offering",
        "server_ip_address": "IP",
        "server_timestamp": 1492565987565,
        "topic_name": "Topic",
        "version": "1.0"
    }

In the above example keys accept_language, app_name and event_timestamp have been dropped.

Vaishak Suresh
  • 5,735
  • 10
  • 41
  • 66

5 Answers5

8

Apparently, spark does not provide any option to handle nulls. So following custom solution should work.

import com.fasterxml.jackson.module.scala.DefaultScalaModule
import com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper
import com.fasterxml.jackson.databind.ObjectMapper

case class EventHeader(accept_language:String,app_id:String,app_name:String,client_ip_address:String,event_id: String,event_timestamp:String,offering_id:String,server_ip_address:String,server_timestamp:Long,topic_name:String,version:String)

val ds = Seq(EventHeader(null,"App_ID",null,"IP","ID",null,"Offering","IP",1492565987565L,"Topic","1.0")).toDS()

val ds1 = ds.mapPartitions(records => {
val mapper = new ObjectMapper with ScalaObjectMapper
mapper.registerModule(DefaultScalaModule)
records.map(mapper.writeValueAsString(_))
})

ds1.coalesce(1).write.text("hdfs://localhost:9000/user/dedupe_employee")

This will produce output as :

{"accept_language":null,"app_id":"App_ID","app_name":null,"client_ip_address":"IP","event_id":"ID","event_timestamp":null,"offering_id":"Offering","server_ip_address":"IP","server_timestamp":1492565987565,"topic_name":"Topic","version":"1.0"}
m-bhole
  • 1,189
  • 10
  • 21
7

If you are on Spark 3, you can add

spark.sql.jsonGenerator.ignoreNullFields false
mani_nz
  • 4,522
  • 3
  • 28
  • 37
0

ignoreNullFields is an option to set when you want DataFrame converted to json file since Spark 3.

If you need Spark 2 (specifically PySpark 2.4.6), you can try converting DataFrame to rdd with Python dict format. And then call pyspark.rdd.saveTextFile to output json file to hdfs. The following example may help.

cols = ddp.columns
ddp_ = ddp.rdd
ddp_ = ddp_.map(lambda row: dict([(c, row[c]) for c in cols])
ddp_ = ddp.repartition(1).saveAsTextFile(your_hdfs_file_path)

This should produce output file like,

{"accept_language": None, "app_id":"123", ...}
{"accept_language": None, "app_id":"456", ...}

What's more, if you want to replace Python None with JSON null, you will need to dump every dict into json.

ddp_ = ddp_.map(lambda row: json.dumps(row, ensure.ascii=False))
lzw
  • 1
  • 1
0

Since Spark 3, and if you are using the class DataFrameWriter

https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameWriter.html#json-java.lang.String-

(same applies for pyspark)

https://spark.apache.org/docs/3.0.0-preview/api/python/_modules/pyspark/sql/readwriter.html

its json method has an option ignoreNullFields=None

where None means True.

So just set this option to false.

ddp.coalesce(20).write().mode("overwrite").option("ignoreNullFields", "false").json("hdfs://localhost:9000/user/dedupe_employee")
bloodrootfc
  • 1,133
  • 12
  • 10
0

To retain null values converting to JSON please set this config option.

spark = ( 
   SparkSession.builder.master("local[1]")
   .config("spark.sql.jsonGenerator.ignoreNullFields", "false")
).getOrCreate()
Mcmil
  • 795
  • 9
  • 25