0

I have a typical batch job that reads CSV from cloud storage then do a bunch of join and aggregate, the whole file does not exceed 3G. But I keep getting OOM exception when writing the result back to storage, I have two executor, each has 80G of RAM, it just doesn't make sense, here is the screen shot of my spark UI and exception. And suggestion is appreciated, if my code is super sub-optimal in terms of memory, why it doesn't show up on the spark UI? enter image description here

enter image description here

update: the source code is too convoluted to show here, but I figured out the essential cause is multiple join.

Dataset<Row> ret = something dataframe
for (String cmd : cmds) {
   ret = ret.join(processDataset(ret, cmd), "primary_key")
}

so, each processDataset(ret, cmd), if you run it on its own, it's very fast, but if you have this kinda of for loop join for a lot of times, say 10 or 20 times, it gets much much much slower, and have this OOM issues.

dex
  • 77
  • 4
  • 13

1 Answers1

3

When I have problems with memory I check these things:

  • Have more executors (more than 2, defined by total-executor-cores in spark-submit and spark.executor.core in SparkSession)
  • Have less cores per executor (3-5). You have 14 which much more than recommended (spark.executor.core)
  • Add memory to executors (spark.executor.memory)
  • Add memory to driver (driver-memory in spark-submit script)
  • Make more partitions (make partitions smaller in size) (.config("spark.sql.shuffle.partitions", numPartitionsShuffle) in SparkSession)
  • Look at PeakExecutionMemory of a Tasks in Stages (one of the additional metrics to turn on) tab to see if it is not to big
  • If you use Mesos in Agents tab you can see the real usage of memory per driver and executors (see this answer How to get Mesos Agents Framework Executor Memory
  • Look at explain in your code to analyze the execution plan
  • See if one of your joins does not explode your memory by making multiple duplicates of lines
astro_asz
  • 2,278
  • 3
  • 15
  • 31
  • hey, thanks for the great answer, I updated the question with some sample code, and I think it's the joins you mentioned in your answer, thanks again! – dex Jan 26 '19 at 02:01