0

I wrote a custom Aggregator (an extension of org.apache.spark.sql.expressions.Aggregator) and Spark invokes it correctly as an aggregating function under group by statement:

sparkSession
    .createDataFrame(...)
    .groupBy(col("id"))
    .agg(
        new MyCustomAggregator().toColumn().name("aggregation_result"))
    .show();

I would like to use it within window function though, because ordering matters to me. I've tried invoking it like that:

sparkSession
    .createDataFrame(...)
    .withColumn("aggregation_result", new MyCustomAggregator().toColumn().over(Window
        .partitionBy(col("id"))
        .orderBy(col("order"))))
    .show();

That's the error I get:

org.apache.spark.sql.AnalysisException: cannot resolve '(PARTITION BY `id` ORDER BY `order` ASC NULLS FIRST unspecifiedframe$())' due to data type mismatch: Cannot use an UnspecifiedFrame. This should have been converted during analysis. Please file a bug report.

Is it at all possible to use custom Aggregators as window functions in Spark 3.0.1? If so, what am I missing here?

mck
  • 40,932
  • 13
  • 35
  • 50
igor
  • 33
  • 3
  • Seems like you need to use UDAF instead of Aggregator: https://stackoverflow.com/questions/50261663/spark-cannot-use-an-unspecifiedframe-this-should-have-been-converted-during-a – mck Dec 01 '20 at 13:45
  • @mck this question was asked two years ago, when Spark 3.0, which introduced significant changes to user defined aggregation, was not around. – igor Dec 01 '20 at 13:56
  • yeah, but you seem to be getting the same error. can you use UDAF instead? – mck Dec 01 '20 at 13:57
  • That would be OK as long as it works, however I was not so far able to make UDAF work either. A code snippet with an example would be appreciated. – igor Dec 01 '20 at 14:08
  • https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html – mck Dec 01 '20 at 14:09
  • @mck this approach uses the deprecated UserDefinedAggregateFunction, which I would rather avoid – igor Dec 01 '20 at 14:21
  • I believe you attempted to write a Custom Aggregator. UDAF's hard stuff. – thebluephantom Dec 01 '20 at 18:42

1 Answers1

0

Yes, Spark 3 does indeed support custom aggregators as window functions.

Here is the Java code:

UserDefinedFunction myCustomAggregation = functions.udaf(new MyCustomAggregator(), Encoders.bean(AggregationInput.class));

sparkSession
    .createDataFrame(...)
    .withColumn("aggregation_result", myCustomAggregation.apply(col("aggregation_input1"), col("aggregation_input2")).over(Window
        .partitionBy(col("id"))
        .orderBy(col("order"))))
    .show();

AggregationInput here is a simple DTO with the row elements needed for your aggregation function.

So no matter whether you aggregate under group by or as a window function you still want to use org.apache.spark.sql.expressions.Aggregator.

igor
  • 33
  • 3