Prevent a column to be nullable when aggregating a non nullable column in PySpark

Question

Is there a way (without an extra cost like proposed here) for the column to remain nullable=False after aggregating a non-nullable column? I guess, in theory, it is impossible to have a null value in the agg column if there is no null value in the original column (and it is the case because the column is non-nullable), so why it does not remain non-nullable? Can we force it?

Here is the demo, the complete use case is to join and remain non-nullable again:

from pyspark.sql import functions as F
from pyspark.sql import types as T
schema = T.StructType(
    [
        T.StructField("left_id", T.StringType(), False),
        T.StructField("right_id", T.StringType(), False),
        T.StructField("count", T.LongType(), False),
    ]
)
coevents = spark.createDataFrame([
        ("a", "b", 2),
        ("a", "c", 3),
        ("b", "a", 6),
    ],
    schema,
)
left_sum = coevents.groupBy("left_id").agg(F.sum("count").alias("left_total_count"))
left_sum.printSchema()
left_sum.show()
coevents = coevents.join(left_sum, on="left_id", how="inner") # inner cannot introduce null values
coevents.printSchema()
coevents.show()

Output:

root
 |-- left_id: string (nullable = false)
 |-- left_total_count: long (nullable = true)
+-------+----------------+
|left_id|left_total_count|
+-------+----------------+
|      b|               6|
|      a|               5|
+-------+----------------+
root
 |-- left_id: string (nullable = false)
 |-- right_id: string (nullable = false)
 |-- count: long (nullable = false)
 |-- left_total_count: long (nullable = true)
+-------+--------+-----+----------------+
|left_id|right_id|count|left_total_count|
+-------+--------+-----+----------------+
|      b|       a|    6|               6|
|      a|       b|    2|               5|
|      a|       c|    3|               5|
+-------+--------+-----+----------------+

left_sum_2 = left_sum.select("left_id",coalesce("left_total_count",lit(0)).alias("notnull")) — Andrew, Aug 05 '22 at 15:33
It works @Andrew, you can write an answer. But is there an extra cost? I'll try to benchmark it. — hayj, Aug 10 '22 at 13:32
No extra cost, at least for this simple case. For struct/array fields, it is a bit more complicated to write and I didn't check if there is an extra cost: ```.withColumn("x", F.coalesce(F.col("x"), F.array(F.struct(F.lit(0).alias("x1"), F.lit(0).alias("x2")))))``` — hayj, Aug 11 '22 at 07:56

score 0 · Accepted Answer · answered Aug 11 '22 at 17:12

As suggested by @Andrew, we can use F.coalesce so that the column remains nullable=False even in the joined dataframe:

left_sum = left_sum.withColumn("left_total_count", F.coalesce("left_total_count", F.lit(0)))

On our cluster, the use of F.coalesce doesn't add an extra cost.

Prevent a column to be nullable when aggregating a non nullable column in PySpark

1 Answers1