Create PySpark Nullable Lit Column

Question

I have a pyspark dataframe where I want to add a new lit column, like this

my_dataframe.select(col("col1"), lit("this is data").alias("col2"))

By default, when I write this to BigQuery, the lit column type is string (good), but the mode is required (bad). How can I write a lit column and make BigQuery think it is nullable? My workaround is below - looking for a cleaner approach.

my_dataframe.select(col("col1"), when(lit(1) == 1, lit("this is data")).alias("col2"))

possible duplicate: https://stackoverflow.com/questions/46072411/can-i-change-the-nullability-of-a-column-in-my-spark-dataframe\ — YOLO, Jan 13 '20 at 20:44

rmesteves · Answer 1 · 2020-01-13T20:58:05.510

You could create a new dataframe with a different schema:

my_dataframe = my_dataframe.select(col("col1"), when(lit(1) == 1, lit("this is data")).alias("col2"))

new_schema = [StructField('col1',StringType(),False), StructField('col2',StringType(),True)]

df2 = sqlContext.createDataFrame(my_dataframe.rdd, StructType(new_schema))

The StructFields follow the syntax: StructField('<COLUMN-NAME', <TYPE> , <NULLABLE? TRUE or FALSE>)

Create PySpark Nullable Lit Column

1 Answers1