5

I think anyone that has used Spark has ran across OOM errors, and usually the source of the problem can be found easily. However, I am a bit perplexed by this one. Currently, I am trying to save by two different partitions, using the partitionBy function. It looks something like below (made up names):

df.write.partitionBy("account", "markers")
  .mode(SaveMode.Overwrite)
  .parquet(s"$location$org/$corrId/")

This particular dataframe has around 30gb of data, 2000 accounts and 30 markers. The accounts and markers are close to evenly distributed. I have tried using 5 core nodes and 1 master node driver of amazon's r4.8xlarge (220+ gb of memory) with the default maximize resource allocation setting (which 2x cores for executors and around 165gb of memory). I have also explicitly set the number of cores, executors to different numbers, but had the same issues. When looking at Ganglia, I don't see any excessive memory consumption.

So, it seems very likely that the root cause is the 2gb ByteArrayBuffer issue that can happen on shuffles. I then tried repartitioning the dataframe with various numbers, such as 100, 500, 1000, 3000, 5000, and 10000 with no luck. The job occasionally logs a heap space error, but most of the time gives a node lost error. When looking at the individual node logs, it just seems to suddenly fail with no indication of the problem (which isn't surprising with some oom exceptions).

For dataframe writes, is there a trick to partitionBy's to either get passed the memory heap space error?

Shaido
  • 27,497
  • 23
  • 70
  • 73
Derek_M
  • 1,018
  • 10
  • 22
  • have you tried `df.repartition($"account",$"markers").write.partitionBy("account", "markers")...` – Raphael Roth Nov 20 '17 at 06:55
  • Ah that is good to know, and strange. @raphael-roth Yeah, I actually tried repartitioning first, but ran into the same issue. I have even tried changing the order. I may create a ticket with spark, because I have documented my long list of trials. – Derek_M Nov 20 '17 at 12:30
  • I am seeing experiencing exactly same issue. Please update the thread if you have more finding. – botchniaque Nov 23 '17 at 12:08
  • @botchniaque I will, if I figure it out. it seems pretty strange. I posted this here https://issues.apache.org/jira/browse/SPARK-22584 with a substantially smaller dataset, and it was immediately closed – Derek_M Nov 23 '17 at 15:08
  • We noticed that setting the `yarn.executor.memoryOverhead` to a value above the default (in your case you could try 3G) helps a lot with unexpected OOM errors. Regarding the `yarn.executor.memoryOverhead` the default is 7% of `executor.memory`, but not less then 384M, so in your case for 16G of `executor.memory` the value used is around 1.1G – botchniaque Dec 08 '17 at 16:35
  • Did you get this issue resolved? I am facing the same problem you are. – Anmol Deep Jan 17 '22 at 07:16

0 Answers0