In this answer, most of the answers introduce the groupByKey
+ reduceGroups
strategy. However I find no comment on why spark removes the reduceByKey
API. There is a comment that says Sparks' Catalyst Optimizer can push down some computation, which may explain why. However, according to the author's and my test, Dataset's groupByKey
+ reduceGroups
strategy is much slower than reduceByKey
.
So why reduceByKey
is removed and how can I find an alternative for it?