7

When saving a pyspark dataframe with a new column added with 'withColumn' function, the nullability changes from false to true.

Version info : Python 3.7.3/Spark2.4.0-cdh6.1.1

>>> l = [('Alice', 1)]
>>> df = spark.createDataFrame(l)
>>> df.printSchema()
root
 |-- _1: string (nullable = true)
 |-- _2: long (nullable = true)

>>> from pyspark.sql.functions import lit
>>> df = df.withColumn('newCol', lit('newVal'))
>>> df.printSchema()
root
 |-- _1: string (nullable = true)
 |-- _2: long (nullable = true)
 |-- newCol: string (nullable = false)

>>> df.write.saveAsTable('default.withcolTest', mode='overwrite')

>>> spark.sql("select * from default.withcolTest").printSchema()
root
 |-- _1: string (nullable = true)
 |-- _2: long (nullable = true)
 |-- newCol: string (nullable = true)

Why does the nullable flag of the column newCol added with withColumn function change when the dataframe is persisted?

M. Twarog
  • 2,418
  • 3
  • 21
  • 39

0 Answers0