Can anyone tell me why aggregate
call via expr
works here but not through functions
? Spark version 3.1.1
I am trying to calculate sum of an array column using aggregate
.
Works:
import pyspark.sql.functions as f
from pyspark.sql.types import *
sdf_login_7.withColumn("new_col", f.array( f.col("a"), f.col("b"))
.cast(ArrayType(IntegerType())))\
.withColumn("sum", f.expr('aggregate(new_col, 0L, (acc,x) -> acc+x)'))\
.select(["a", "b", "new_col", "sum"]).show(3)
+---+---+--------+----+
| a| b| new_col| sum|
+---+---+--------+----+
| 10| 41|[10, 41]| 51|
| 11| 74|[11, 74]| 85|
| 11| 80|[11, 80]| 91|
+---+---+--------+----+
only showing top 3 rows
Doesn't work:
sdf_login_7.withColumn("new_col", f.array( f.col("a"), f.col("b"))
.cast(ArrayType(IntegerType())))\
.withColumn("sum", f.aggregate("new_col", f.lit(0), lambda acc, x: acc+x))\
.select(["a", "b", "new_col", "sum"]).show(3)
Py4JError: org.apache.spark.sql.catalyst.expressions.UnresolvedNamedLambdaVariable.freshVarName does not exist in the JVM