0

Can anyone tell me why aggregate call via expr works here but not through functions? Spark version 3.1.1

I am trying to calculate sum of an array column using aggregate.

Works:

import pyspark.sql.functions as f
from pyspark.sql.types import *

sdf_login_7.withColumn("new_col", f.array( f.col("a"), f.col("b"))
                       .cast(ArrayType(IntegerType())))\
           .withColumn("sum",  f.expr('aggregate(new_col, 0L, (acc,x) -> acc+x)'))\
           .select(["a", "b", "new_col", "sum"]).show(3)

+---+---+--------+----+
|  a|  b| new_col| sum|
+---+---+--------+----+
| 10| 41|[10, 41]|  51|
| 11| 74|[11, 74]|  85|
| 11| 80|[11, 80]|  91|
+---+---+--------+----+
only showing top 3 rows

Doesn't work:

sdf_login_7.withColumn("new_col", f.array( f.col("a"), f.col("b"))
                       .cast(ArrayType(IntegerType())))\
           .withColumn("sum",  f.aggregate("new_col", f.lit(0), lambda acc, x: acc+x))\
           .select(["a", "b", "new_col", "sum"]).show(3)
 

Py4JError: org.apache.spark.sql.catalyst.expressions.UnresolvedNamedLambdaVariable.freshVarName does not exist in the JVM

s510
  • 2,271
  • 11
  • 18
  • That's strange. I'm not sure. Maybe you could try `f.lit(0).cast('int')` instead of just `f.lit(0)`. I can't check it currently. – ZygD Sep 15 '22 at 18:48
  • Still the same :/ – s510 Sep 15 '22 at 19:36
  • I've tested the scripts on Spark 3.1.1 with this dataframe: `sdf_login_7 = spark.createDataFrame([(10, 41), (11, 74), (11, 80)], ['a', 'b'])`. Both scripts work fine. The problem should be somewhere else. – ZygD Sep 16 '22 at 03:39
  • The name of a new col in failing script should be "sum" instead of "mean", but I doubt this is the cause. – ZygD Sep 16 '22 at 03:41
  • Thanks for pointing that out and trying. Corrected it now. Yes, still the problem exists. I found this while I was working on this question https://stackoverflow.com/a/73732617/12775531 – s510 Sep 16 '22 at 07:05
  • both scripts work for me using spark 3.3 – walking Sep 17 '22 at 09:22
  • You don't show how you create the input dataframe neither in this question, nor in the referenced answer. I have to vote for close as not reproducible. – ZygD Sep 19 '22 at 05:24

0 Answers0