Why there is no reduceBykey in spark's Dataset

Question

In this answer, most of the answers introduce the groupByKey + reduceGroups strategy. However I find no comment on why spark removes the reduceByKey API. There is a comment that says Sparks' Catalyst Optimizer can push down some computation, which may explain why. However, according to the author's and my test, Dataset's groupByKey + reduceGroups strategy is much slower than reduceByKey.

So why reduceByKey is removed and how can I find an alternative for it?

Can you share those tests? And the input data? I would expect them to be implemented in similar ways - Anyways, I doubt this is the correct place to ask such questions, the only ones that can answer that objectively would be the spark contributors. Maybe their mailing channel or bug tracker would be a better place. — Luis Miguel Mejía Suárez, Aug 05 '19 at 15:27

score 2 · Accepted Answer · answered Nov 02 '19 at 02:18

The comments in that answer suggest that since Spark 2.1.0, groupByKey followed by reduceGroups on a Dataset behaves in the same manner as a reduceByKey operation on a RDD.

https://issues.apache.org/jira/browse/SPARK-16391

Spark hasn't removed the reduceByKey API. To use reduceByKey, your data has to be the pair RDD type. For example, if you have a dataset and want to try using reduceByKey you would have to do something like:

df
 .map(row => (row.key, row.value))
 .rdd
 .reduceByKey((a,b) => SomeReductionFunction(a,b))
 .values
 .toDF()

Note the 2nd line turns your dataset row into an RDD with 2 "columns" (a key and a value) since the reduceByKey expects a pair RDD. This method is also not performant if you already have a Dataset type, as it converts your dataset into a rdd, then back into a dataframe or dataset if you want to continue operations on a dataset.

Why there is no reduceBykey in spark's Dataset

1 Answers1