1

I have a data set extracted from Hbase, which is a long form of wide table, i.e has rowKey, columnQualifier and value columns. To get a form of pivot, I need to group by rowKey, which is a string UUID, into a collection and make an object out of the collection. The problem is that only group-by I manage to perform is count the number of elements in groups; other group-bys fail due to container being kill due to memory overflow beyond YARN container limits. I did experiment a lot with the memory sizes, including overhead, partitioning with and without sorting etc. I went even into a high number of partitions i.e. about 10 000 but the job dies the same. I tried both DataFrame groupBy and collect_list, as well as Dataset grouByKey and mapGroups.

The code works on a small data set but not on the larger one. The data set is about 500 GB in Parquet files. The data is not skewed as the largest group in group by have only 50 elements. Thus, by all known to me means the partitions should easily fit in memory as the aggregated data per one rowKey is not really large. The data keys and values are mostly strings and there are not long.

I am using Spark 2.0.2; the above computations were all done is Scala.

1 Answers1

1

You're probably running into the dreaded groupByKey shuffle. Please read this Databricks article on avoiding groupByKey, which details the underlying differences between the two functions.

If you don't want the read the article, the short story is this: Though groupByKey and reduceByKey produce the same results, groupByKey instantiates a shuffle of ALL data, while reduceByKey tries to minimize data shuffle by reducing first. A bit like MapReduce Combiners, if you're familiar with that concept.

JamCon
  • 2,313
  • 2
  • 25
  • 34
  • Thanks for the suggestion, but I know the article and how it works. In this case the problem is quite bounded, thus, I thought this `groupBy` shouldn't be that expensive. I have not tried `reduceByKey`, I will give it a go. Nonetheless, in case of my problem I either have to either use `Map[String,Any]` or use reflection to pack single values into a large object and create a notion of sum. Interestingly, also using DataFrame function `collect_list`, which should be optimal and presumably avoids `groupByKey` throws the same error. – Jakub Nowacki Feb 20 '17 at 20:17
  • I have rebuild the job using `reduceByKey` approach and it is more optimal, despite the fact that I'm concatenating maps. It can be done in `Dataset` using operations `groupByKey` and `reduceGroups` or `mapGroups`, but it is not so optimal: see [this post](http://stackoverflow.com/questions/38383207/rolling-your-own-reducebyke-in-spark-dataset) – Jakub Nowacki Mar 21 '17 at 14:20