MongoTypeConversionException: Cannot cast STRING into a NullType with Mongo Spark Connector even when explicit schema does not contain NullTypes

Question

I'm importing a collection from MongodB to Spark.

val partitionDF = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("database", "db").option("collection", collectionName).load()

For the data column in the resulting DataFrame, I get this type:

StructType(StructField(configurationName,NullType,true), ...

so at least some types in some columns are NullType.

As per Writing null values to Parquet in Spark when the NullType is inside a StructType , I try fixing the schema by replacing all NullTypes with StringTypes:

def denullifyStruct(struct: StructType): StructType = {
  val items = struct.map{ field => StructField(field.name, denullify(field.dataType), field.nullable, field.metadata) }
  StructType(items)
}

def denullify(dt: DataType): DataType = {
  if (dt.isInstanceOf[StructType]) {
    val struct = dt.asInstanceOf[StructType]
    return denullifyStruct(struct)
  } else if (dt.isInstanceOf[ArrayType]) {
    val array = dt.asInstanceOf[ArrayType]
    return ArrayType(denullify(array.elementType), array.containsNull)
  } else if (dt.isInstanceOf[NullType]) {
    return StringType
  }
  return dt
}

val fixedDF = spark.createDataFrame(partitionDF.rdd, denullifyStruct(partitionDF.schema))

Issuing fixedDF.printSchema I can see that no NullType exists in the fixedDF's schema anymore. But when I try to save it to Parquet

fixedDF.write.mode("overwrite").parquet(partitionName + ".parquet")

I get the following error:

Caused by: com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast STRING into a NullType (value: BsonString{value='117679.8'})
    at com.mongodb.spark.sql.MapFunctions$.convertToDataType(MapFunctions.scala:214)
    at com.mongodb.spark.sql.MapFunctions$.$anonfun$documentToRow$1(MapFunctions.scala:37)
    at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
    at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
    at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)

A NullType again!

The same issue occurs when I just count the number of rows: fixedDF.count().

Does Spark infer the schema again when writing to Parquet (or counting)? Is it possible to turn such inference off (or overcome this in some other way)?

@jonas I kinda found a solution that works for me, please see my answer — Roman Puchkovskiy, Aug 26 '21 at 14:52

score 2 · Answer 1 · answered Aug 26 '21 at 14:51

The problem is that, even if you supply a DataFrame with an explicit schema, for some operations (like count() or for saving to disk) a Mongo-derived DataFrame will still infer the schema.

To infer the schema, it uses sampling which means that it does not see some data while inferring. If it only sees some field having a null value, it will infer NullType for it. And later, when it encounters this field with some string, such a string would not be able to be converted to NullType.

So the fundamental problem here is sampling. If your schema is stable and 'dense' (every or near every document has all fields filled), sampling will work well. But if some fields are 'sparse' (null with high probability), sampling could fail.

A crude solution is to avoid sampling altogether. That is, infer schema using general population and not a sample. If there is no too much data (or you are able to wait), it could work.

Here is an experimental branch: https://github.com/rpuch/mongo-spark/tree/read-full-collection-instead-of-sampling

The idea is to switch from sampling to using the whole collection if configured so. It is a bit too cumbersome to introduce a new configuration option, so I just disable sampling if 'sampleSize' configuration option is set to 1, like this:

.option("sampleSize", 1) // MAGIC! This effectively turns sampling off, instead the schema is inferred based on general population

In such a case, the sampling is avoided altogether. An obvious solution to sample using N equal to the collection size makes MongoDB sort a lot of data in memory which seems problematic. Hence I disable sampling completely.

i dont think this `.option("sampleSize", 1) ` offs sampling. When i do like this and generate schema it generates `10kB` size json file. But when i increase the `sampleSize` to `5000000` then the schema generated is `71kB` so clearly setting `sampleSize` to `1` is not Offing sampling as claimed --and its not analyzing all the data as claimed. — chendu, May 09 '22 at 13:43
@chen are you sure you tried this `.option("sampleSize", 1)` with `mongo-spark` built from the patched version available here https://github.com/rpuch/mongo-spark/commits/read-full-collection-instead-of-sampling and not with a standard `mongo-spark` version? — Roman Puchkovskiy, May 09 '22 at 16:15
Not from patch. got the mistake. Its risky for me to use the patch. We may upgrade tomorrow to new versions — chendu, May 09 '22 at 22:29

score 1 · Answer 2 · answered Aug 12 '21 at 13:08

1

Issue is not due to parquet write method. Error is occurring While reading data as dataframe due to some type cast problem. This jira page says we need to add samplePoolSize option along with other options while reading data from mondoDB.

answered Aug 12 '21 at 13:08

Mohana B C

5,021
1
9
28

MongoTypeConversionException: Cannot cast STRING into a NullType with Mongo Spark Connector even when explicit schema does not contain NullTypes

2 Answers2