2

Our Spark executors logs had these:

org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval

Figuring out that these are heartbeats from executors to the driver, I suspected GC issus on the driver, thus enabled GC logging, and found these:

[Full GC (System.gc()) 5402.271: [CMS: 10188280K->8448710K(14849412K),27.2815605 secs] 10780958K->8448710K(15462852K), [Metaspace: 93432K->93432K(96256K)], 27.2833999 secs] [Times: user=27.28 sys=0.01, real=27.29 secs]

Evidently, something calls System.gc(), causing long GC pauses like this on the driver (27 seconds). Looking further, RMI is a suspect, as these System.gc() calls take place every 30 minutes exactly. I couldn't find any reference to this issue with RMI on Spark driver. Should i go ahead and disable System.gc() calls by setting -XX:+DisableExplicitGC?

Daniel Nitzan
  • 1,582
  • 3
  • 19
  • 36

2 Answers2

1

It's funny I was just looking into a similar issue. I can see that some of the code in Spark does in fact invoke System.gc().

Might be worthwhile to open up a JIRA in Spark to discuss this.

I understand that it's not best practice to make a call using System.gc(), and mainly because it will stop all other threads which has a significant performance impact. However, I can see in the Java Oracle documentation that an additional JVM argument was introduced starting with Java 1.6 in order to run System.gc() concurrently (-XX:+ExplicitGCInvokesConcurrent):

http://docs.oracle.com/javase/6/docs/technotes/guides/vm/cms-6.html

You can perhaps try to set that as an additional parameter:

spark.executor.extraJavaOptions="-XX:+ExplicitGCInvokesConcurrent"

Depending on how you're setting the parameter you would either put this in your Spark's configuration file, or pass it using the --conf line argument in your spark command (spark-submit, spark-shell, etc...).

Update:

Found the following comment in the ContextCleaner.scala file for Spark 2.x:

/**
* How often to trigger a garbage collection in this JVM.
*
* This context cleaner triggers cleanups only when weak references are  garbage collected.
* In long-running applications with large driver JVMs, where there is little memory pressure
* on the driver, this may happen very occasionally or not at all. Not  cleaning at all may
* lead to executors running out of disk space after a while.
*/
Tshilidzi Mudau
  • 7,373
  • 6
  • 36
  • 49
wFateem
  • 346
  • 1
  • 2
  • 11
  • What `System.gc()` will actually do, if anything, in terms of actual garbage collection, and how, is undefined. The documentation says nothing more than 'provides a hint'. – user207421 Aug 12 '17 at 23:08
  • Are you referring to the link I provided? It's there just for reference about the -XX:+ExplicitGCInvokesConcurrent parameter, and that would be useful assuming that the System.gc() call was honored and that the JVM actually kicked off a GC. If it an explicit GC actually ran it can definitely have a performance impact since it will pause all threads in the JVM. – wFateem Aug 13 '17 at 00:27
0

https://github.com/apache/spark/blob/branch-3.3/core/src/main/scala/org/apache/spark/ContextCleaner.scala#L92

  /**
   * How often to trigger a garbage collection in this JVM.
   *
   * This context cleaner triggers cleanups only when weak references are garbage collected.
   * In long-running applications with large driver JVMs, where there is little memory pressure
   * on the driver, this may happen very occasionally or not at all. Not cleaning at all may
   * lead to executors running out of disk space after a while.
   */
  private val periodicGCInterval = sc.conf.get(CLEANER_PERIODIC_GC_INTERVAL)

and document https://spark.apache.org/docs/latest/configuration.html#memory-management

spark.cleaner.periodicGC.interval   30min   Controls how often to trigger a garbage collection.

This context cleaner triggers cleanups only when weak references are garbage collected. In long-running applications with large driver JVMs, where there is little memory pressure on the driver, this may happen very occasionally or not at all. Not cleaning at all may lead to executors running out of disk space after a while.
Dyno Fu
  • 8,753
  • 4
  • 39
  • 64