0

I am facing the java.lang.OutOfMemoryError: Java Heap Space issue when I run the same spark program every 2nd time.

Here is a scenario:

When I do the spark-submit and runs the spark program for the first time, it gives me the correct output & everything is fine. When I execute the same spark-submit one more time, it is throwing java.lang.OutOfMemoryError: Java Heap Space exception.

When it again works?

If I run the same spark-submit after clearing the linux cache by executing - /proc/sys/vm/drop_caches it again runs successfully for one single time.

I tried setting all possible spark configs like memoryOverhead, drive-memory, executor-memory, etc.

Any idea whats happening here? Is this really a problem with spark code, or its happening because of some linux machine setting or the way cluster is configured?

Thanks.

Saurabh Deshpande
  • 1,153
  • 4
  • 18
  • 35
  • it should mainly depend on JRE's version and implementation – QuickSilver Jun 18 '20 at 04:27
  • I believe its do with underlying JVM settings. The resources should be cleaned up after spark job dies. – Constantine Jun 18 '20 at 08:18
  • @QuickSilver - Thanks. Which JRE version / implementation we should go with? Can you please give some more clarity with some numbers? That will help a lot – Saurabh Deshpande Jun 18 '20 at 23:17
  • @Constantine - Thanks for you reply. When we say JVM settings.. Where we should do the change & can you tell which exact property we need to configure in order to achieve this - `resources should be cleaned up after spark job dies` – Saurabh Deshpande Jun 18 '20 at 23:19

1 Answers1

0

In case of using df.persist() or df.cache() then you should be also using df.unpersist() method and there's also sqlContext.clearCache() which clears all.