pyspark.sql.utils.AnalysisException: Parquet data source does not support void data type

Question

I am trying to add a column in my dataframe df1 in PySpark.

The code I tried:

import pyspark.sql.functions as F
df1 = df1.withColumn("empty_column", F.lit(None))

But I get this error:

pyspark.sql.utils.AnalysisException: Parquet data source does not support void data type.

Can anyone help me with this?

ZygD · Accepted Answer · 2022-10-18T20:31:45.297

Instead of just F.lit(None), use it with a cast and a proper data type. E.g.:

F.lit(None).cast('string')

F.lit(None).cast('double')

When we add a literal null column, it's data type is void:

from pyspark.sql import functions as F
spark.range(1).withColumn("empty_column", F.lit(None)).printSchema()
# root
#  |-- id: long (nullable = false)
#  |-- empty_column: void (nullable = true)

But when saving as parquet file, void data type is not supported, so such columns must be cast to some other data type.

pyspark.sql.utils.AnalysisException: Parquet data source does not support void data type

1 Answers1