I have a data set extracted from Hbase, which is a long form of wide table, i.e has rowKey
, columnQualifier
and value
columns. To get a form of pivot, I need to group by rowKey
, which is a string UUID, into a collection and make an object out of the collection. The problem is that only group-by I manage to perform is count the number of elements in groups; other group-bys fail due to container being kill due to memory overflow beyond YARN container limits. I did experiment a lot with the memory sizes, including overhead, partitioning with and without sorting etc. I went even into a high number of partitions i.e. about 10 000 but the job dies the same. I tried both DataFrame groupBy
and collect_list
, as well as Dataset grouByKey
and mapGroups
.
The code works on a small data set but not on the larger one. The data set is about 500 GB in Parquet files. The data is not skewed as the largest group in group by have only 50 elements. Thus, by all known to me means the partitions should easily fit in memory as the aggregated data per one rowKey
is not really large. The data keys and values are mostly strings and there are not long.
I am using Spark 2.0.2; the above computations were all done is Scala.