1

I just wanted to know, why my ParallelGC (--conf "spark.executor.extraJavaOptions=-XX:+UseParallelGC") in a very long Spark ML Pipeline works faster than when I set G1GC (--conf "spark.executor.extraJavaOptions=-XX:+UseG1GC"), even though the Spark community suggests G1GC to be much better than the ParallelGC.

Any pointers on this would help.

Aakash Basu
  • 1,689
  • 7
  • 28
  • 57

1 Answers1

2

If you want to know how it works in your case then you need to do few experiments to collect data on JVM performance with each set of options. This is required because nobody except you knows your exact case, your environment and data load of your application.

You need to profile JVMs of your cluster with enabled debug flags that switches on logging of each GC action and their times and correlate that data with load metrics during application run. Also you can use any visual profiler to be able to see GC metrics in real-time graphs (VisualVM, Mission Control, etc).

gemelen
  • 619
  • 10
  • 18