I think anyone that has used Spark has ran across OOM errors, and usually the source of the problem can be found easily. However, I am a bit perplexed by this one. Currently, I am trying to save by two different partitions, using the partitionBy
function. It looks something like below (made up names):
df.write.partitionBy("account", "markers")
.mode(SaveMode.Overwrite)
.parquet(s"$location$org/$corrId/")
This particular dataframe has around 30gb of data, 2000 accounts and 30 markers. The accounts and markers are close to evenly distributed. I have tried using 5 core nodes and 1 master node driver of amazon's r4.8xlarge (220+ gb of memory) with the default maximize resource allocation setting (which 2x cores for executors and around 165gb of memory). I have also explicitly set the number of cores, executors to different numbers, but had the same issues. When looking at Ganglia, I don't see any excessive memory consumption.
So, it seems very likely that the root cause is the 2gb ByteArrayBuffer issue that can happen on shuffles. I then tried repartitioning the dataframe with various numbers, such as 100, 500, 1000, 3000, 5000, and 10000 with no luck. The job occasionally logs a heap space error, but most of the time gives a node lost error. When looking at the individual node logs, it just seems to suddenly fail with no indication of the problem (which isn't surprising with some oom exceptions).
For dataframe writes, is there a trick to partitionBy's to either get passed the memory heap space error?