I'm importing a collection from MongodB to Spark. All the documents have field 'data' which in turn is a structure and has field 'configurationName' (which is always null).
val partitionDF = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("database", "db").option("collection", collectionName).load()
For the data
column in the resulting DataFrame
, I get this type:
StructType(StructField(configurationName,NullType,true), ...
When I try to save the dataframe as Parquet
partitionDF.write.mode("overwrite").parquet(collectionName + ".parquet")
I get the following error:
AnalysisException: Parquet data source does not support struct<configurationName:null, ...
It looks like the problem is that I have that NullType
buried in the data
column's type. I'm looking at How to handle null values when writing to parquet from Spark , but it only shows how to solve this NullType
problem on the top-level columns.
But how do you solve this problem when a NullType
is not at the top level? The only idea I have so far is to flatten the dataframe completely (exploding arrays and so on) and then all the NullType
s would pop at the top. But in such a case I would lose the original structure of the data (which I don't want to lose).
Is there a better solution?