I'm importing a collection from MongodB to Spark.
val partitionDF = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("database", "db").option("collection", collectionName).load()
For the data
column in the resulting DataFrame
, I get this type:
StructType(StructField(configurationName,NullType,true), ...
so at least some types in some columns are NullType
.
As per Writing null values to Parquet in Spark when the NullType is inside a StructType , I try fixing the schema by replacing all NullType
s with StringType
s:
def denullifyStruct(struct: StructType): StructType = {
val items = struct.map{ field => StructField(field.name, denullify(field.dataType), field.nullable, field.metadata) }
StructType(items)
}
def denullify(dt: DataType): DataType = {
if (dt.isInstanceOf[StructType]) {
val struct = dt.asInstanceOf[StructType]
return denullifyStruct(struct)
} else if (dt.isInstanceOf[ArrayType]) {
val array = dt.asInstanceOf[ArrayType]
return ArrayType(denullify(array.elementType), array.containsNull)
} else if (dt.isInstanceOf[NullType]) {
return StringType
}
return dt
}
val fixedDF = spark.createDataFrame(partitionDF.rdd, denullifyStruct(partitionDF.schema))
Issuing fixedDF.printSchema
I can see that no NullType
exists in the fixedDF
's schema anymore. But when I try to save it to Parquet
fixedDF.write.mode("overwrite").parquet(partitionName + ".parquet")
I get the following error:
Caused by: com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast STRING into a NullType (value: BsonString{value='117679.8'})
at com.mongodb.spark.sql.MapFunctions$.convertToDataType(MapFunctions.scala:214)
at com.mongodb.spark.sql.MapFunctions$.$anonfun$documentToRow$1(MapFunctions.scala:37)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
A NullType
again!
The same issue occurs when I just count the number of rows: fixedDF.count()
.
Does Spark infer the schema again when writing to Parquet (or counting)? Is it possible to turn such inference off (or overcome this in some other way)?