Spark containers killed by YARN during group by

Question

I have a data set extracted from Hbase, which is a long form of wide table, i.e has rowKey, columnQualifier and value columns. To get a form of pivot, I need to group by rowKey, which is a string UUID, into a collection and make an object out of the collection. The problem is that only group-by I manage to perform is count the number of elements in groups; other group-bys fail due to container being kill due to memory overflow beyond YARN container limits. I did experiment a lot with the memory sizes, including overhead, partitioning with and without sorting etc. I went even into a high number of partitions i.e. about 10 000 but the job dies the same. I tried both DataFrame groupBy and collect_list, as well as Dataset grouByKey and mapGroups.

The code works on a small data set but not on the larger one. The data set is about 500 GB in Parquet files. The data is not skewed as the largest group in group by have only 50 elements. Thus, by all known to me means the partitions should easily fit in memory as the aggregated data per one rowKey is not really large. The data keys and values are mostly strings and there are not long.

I am using Spark 2.0.2; the above computations were all done is Scala.

Have you increased the executors memory ? If so, by how much ? — A.Perrot, Feb 16 '17 at 15:19
Yes, as I said I did a lot of experimentation including executor memory and overhead, number of executors and cores, partitions etc. The problem is also not skewed, which is the usual suspect in the case of grouping. — Jakub Nowacki, Feb 16 '17 at 15:40

score 1 · Accepted Answer · answered Feb 19 '17 at 18:56

1

You're probably running into the dreaded groupByKey shuffle. Please read this Databricks article on avoiding groupByKey, which details the underlying differences between the two functions.

If you don't want the read the article, the short story is this: Though groupByKey and reduceByKey produce the same results, groupByKey instantiates a shuffle of ALL data, while reduceByKey tries to minimize data shuffle by reducing first. A bit like MapReduce Combiners, if you're familiar with that concept.

answered Feb 19 '17 at 18:56

JamCon

2,313
2
25
34

Thanks for the suggestion, but I know the article and how it works. In this case the problem is quite bounded, thus, I thought this `groupBy` shouldn't be that expensive. I have not tried `reduceByKey`, I will give it a go. Nonetheless, in case of my problem I either have to either use `Map[String,Any]` or use reflection to pack single values into a large object and create a notion of sum. Interestingly, also using DataFrame function `collect_list`, which should be optimal and presumably avoids `groupByKey` throws the same error. – Jakub Nowacki Feb 20 '17 at 20:17
I have rebuild the job using `reduceByKey` approach and it is more optimal, despite the fact that I'm concatenating maps. It can be done in `Dataset` using operations `groupByKey` and `reduceGroups` or `mapGroups`, but it is not so optimal: see [this post](http://stackoverflow.com/questions/38383207/rolling-your-own-reducebyke-in-spark-dataset) – Jakub Nowacki Mar 21 '17 at 14:20

Spark containers killed by YARN during group by

1 Answers1