Pyspark : aggregate in expression vs function call

Question

Can anyone tell me why aggregate call via expr works here but not through functions? Spark version 3.1.1

I am trying to calculate sum of an array column using aggregate.

Works:

import pyspark.sql.functions as f
from pyspark.sql.types import *

sdf_login_7.withColumn("new_col", f.array( f.col("a"), f.col("b"))
                       .cast(ArrayType(IntegerType())))\
           .withColumn("sum",  f.expr('aggregate(new_col, 0L, (acc,x) -> acc+x)'))\
           .select(["a", "b", "new_col", "sum"]).show(3)

+---+---+--------+----+
|  a|  b| new_col| sum|
+---+---+--------+----+
| 10| 41|[10, 41]|  51|
| 11| 74|[11, 74]|  85|
| 11| 80|[11, 80]|  91|
+---+---+--------+----+
only showing top 3 rows

Doesn't work:

sdf_login_7.withColumn("new_col", f.array( f.col("a"), f.col("b"))
                       .cast(ArrayType(IntegerType())))\
           .withColumn("sum",  f.aggregate("new_col", f.lit(0), lambda acc, x: acc+x))\
           .select(["a", "b", "new_col", "sum"]).show(3)

Py4JError: org.apache.spark.sql.catalyst.expressions.UnresolvedNamedLambdaVariable.freshVarName does not exist in the JVM

That's strange. I'm not sure. Maybe you could try `f.lit(0).cast('int')` instead of just `f.lit(0)`. I can't check it currently. — ZygD, Sep 15 '22 at 18:48
I've tested the scripts on Spark 3.1.1 with this dataframe: `sdf_login_7 = spark.createDataFrame([(10, 41), (11, 74), (11, 80)], ['a', 'b'])`. Both scripts work fine. The problem should be somewhere else. — ZygD, Sep 16 '22 at 03:39
The name of a new col in failing script should be "sum" instead of "mean", but I doubt this is the cause. — ZygD, Sep 16 '22 at 03:41
Thanks for pointing that out and trying. Corrected it now. Yes, still the problem exists. I found this while I was working on this question https://stackoverflow.com/a/73732617/12775531 — s510, Sep 16 '22 at 07:05
You don't show how you create the input dataframe neither in this question, nor in the referenced answer. I have to vote for close as not reproducible. — ZygD, Sep 19 '22 at 05:24

Pyspark : aggregate in expression vs function call

0 Answers0