Does Spark 3.0.1 support custom Aggregators on window functions?

Question

I wrote a custom Aggregator (an extension of org.apache.spark.sql.expressions.Aggregator) and Spark invokes it correctly as an aggregating function under group by statement:

sparkSession
    .createDataFrame(...)
    .groupBy(col("id"))
    .agg(
        new MyCustomAggregator().toColumn().name("aggregation_result"))
    .show();

I would like to use it within window function though, because ordering matters to me. I've tried invoking it like that:

sparkSession
    .createDataFrame(...)
    .withColumn("aggregation_result", new MyCustomAggregator().toColumn().over(Window
        .partitionBy(col("id"))
        .orderBy(col("order"))))
    .show();

That's the error I get:

org.apache.spark.sql.AnalysisException: cannot resolve '(PARTITION BY `id` ORDER BY `order` ASC NULLS FIRST unspecifiedframe$())' due to data type mismatch: Cannot use an UnspecifiedFrame. This should have been converted during analysis. Please file a bug report.

Is it at all possible to use custom Aggregators as window functions in Spark 3.0.1? If so, what am I missing here?

Seems like you need to use UDAF instead of Aggregator: https://stackoverflow.com/questions/50261663/spark-cannot-use-an-unspecifiedframe-this-should-have-been-converted-during-a — mck, Dec 01 '20 at 13:45
@mck this question was asked two years ago, when Spark 3.0, which introduced significant changes to user defined aggregation, was not around. — igor, Dec 01 '20 at 13:56
yeah, but you seem to be getting the same error. can you use UDAF instead? — mck, Dec 01 '20 at 13:57
That would be OK as long as it works, however I was not so far able to make UDAF work either. A code snippet with an example would be appreciated. — igor, Dec 01 '20 at 14:08
https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html — mck, Dec 01 '20 at 14:09
@mck this approach uses the deprecated UserDefinedAggregateFunction, which I would rather avoid — igor, Dec 01 '20 at 14:21
I believe you attempted to write a Custom Aggregator. UDAF's hard stuff. — thebluephantom, Dec 01 '20 at 18:42

score 0 · Accepted Answer · answered Dec 02 '20 at 07:49

Yes, Spark 3 does indeed support custom aggregators as window functions.

Here is the Java code:

UserDefinedFunction myCustomAggregation = functions.udaf(new MyCustomAggregator(), Encoders.bean(AggregationInput.class));

sparkSession
    .createDataFrame(...)
    .withColumn("aggregation_result", myCustomAggregation.apply(col("aggregation_input1"), col("aggregation_input2")).over(Window
        .partitionBy(col("id"))
        .orderBy(col("order"))))
    .show();

AggregationInput here is a simple DTO with the row elements needed for your aggregation function.

So no matter whether you aggregate under group by or as a window function you still want to use org.apache.spark.sql.expressions.Aggregator.

Does Spark 3.0.1 support custom Aggregators on window functions?

1 Answers1