Our Spark executors logs had these:
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
Figuring out that these are heartbeats from executors to the driver, I suspected GC issus on the driver, thus enabled GC logging, and found these:
[Full GC (System.gc()) 5402.271: [CMS: 10188280K->8448710K(14849412K),27.2815605 secs] 10780958K->8448710K(15462852K), [Metaspace: 93432K->93432K(96256K)], 27.2833999 secs] [Times: user=27.28 sys=0.01, real=27.29 secs]
Evidently, something calls System.gc(), causing long GC pauses like this on the driver (27 seconds). Looking further, RMI is a suspect, as these System.gc()
calls take place every 30 minutes exactly.
I couldn't find any reference to this issue with RMI on Spark driver. Should i go ahead and disable System.gc()
calls by setting -XX:+DisableExplicitGC
?