1

From:

https://github.com/spotify/scio/wiki/Scio-data-guideline

"Prefer combine/aggregate/reduce transforms over groupByKey. Keep in mind that a reduce operation must be associative and commutative."

Why in particular would one prefer an aggregate over a groupByKey?

Andrew Cassidy
  • 2,940
  • 1
  • 22
  • 46

1 Answers1

7

Combine, aggregation, and reduce transforms are preferred over groupByKey because the former are more memory efficient during pipeline execution. This is due to the implementation of the primitive GroupByKey and Combine transforms in Apache Beam. The answer to this question isn't necessarily specific to Scio.

GroupByKey requires that all key-value pairs remain in memory, which could result in OutOfMemoryErrors. All key-value pairs remain in memory per window. groupByKey uses Beam's primitive GroupByKey transform.

Aggregations remove the need to hold all values in memory because values are continually combined/reduced during the execution of the transform. Values are combined/reduced in a non-deterministic order, which is why all combine/reduce operations must be associative. Scio's implementation of aggregateByKey uses Beam's primitive Combine transform.

References:
1. Scio groupByKey
2. Scio aggregateByKey
3. Apache Beam GroupByKey
4. Apache Beam Combine
5. Google Cloud Dataflow Combine

Andrew Nguonly
  • 2,258
  • 1
  • 17
  • 23
  • I'd also recommend checking out Daniel's answer at https://stackoverflow.com/questions/6928374/example-of-the-scala-aggregate-function – Andrew Cassidy May 13 '18 at 21:37
  • I guess what I'm still confused about is how do you use aggregateByKey without first creating a PairSCollectionFunctions which requires groupByKey? DOE... I just got it. You assign the key to the original SCollection. – Andrew Cassidy May 13 '18 at 21:43
  • I don't know about Scio in particular, but Beam in general can handle GroupByKey where not all the key-value pairs fit into memory. Aggregation is still preferred as it allows one to offload some of the reduction to the mappers before grouping (both distributing CPU load and reducing the amoutn of data shuffled). – robertwb Feb 21 '19 at 08:37