7

I am trying to add a column in my dataframe df1 in PySpark.

The code I tried:

import pyspark.sql.functions as F
df1 = df1.withColumn("empty_column", F.lit(None))

But I get this error:

pyspark.sql.utils.AnalysisException: Parquet data source does not support void data type.

Can anyone help me with this?

ZygD
  • 22,092
  • 39
  • 79
  • 102
ar_mm18
  • 415
  • 2
  • 8

1 Answers1

8

Instead of just F.lit(None), use it with a cast and a proper data type. E.g.:

F.lit(None).cast('string')
F.lit(None).cast('double')

When we add a literal null column, it's data type is void:

from pyspark.sql import functions as F
spark.range(1).withColumn("empty_column", F.lit(None)).printSchema()
# root
#  |-- id: long (nullable = false)
#  |-- empty_column: void (nullable = true)

But when saving as parquet file, void data type is not supported, so such columns must be cast to some other data type.

ZygD
  • 22,092
  • 39
  • 79
  • 102