5

I came across with an interesting question that, the different methods of submitting the spark application from windows development environment. Generally, we submit spark job using spark-submit and also we can execute uber jar (dependent spark libraries assembled with jar) using java -jar

  • Command using java -jar: java -jar -Xmx1024m /home/myuser/myhar.jar
  • Command using spark-submit: spark-submit --master local[*] /home/myuser/myhar.jar

Since, I can execute the job using both method, I observed that sometimes java -jar method is faster and sometimes spark-submit is faster for same data-set (say 20000 rows with lots of data shuffling login inside).spark-submit has better option to control executors and memory etc using command line argument, however java -jar, method we need to hard-code inside the code itself. If we run the jar with large data-set java -jar is throwing out of memory exception while spark-submit is though taking time but executing without error with default configurations.

I couldn't understand the difference by submitting application using spark-submit and java-jar hence my questions is:

How the execution happen, when we submit application using java-jar. Does it execute inside the jvm memory itself and not using any spark properties?

Sandeep Singh
  • 7,790
  • 4
  • 43
  • 68
  • depends upon data, if we ran spark on small chunks of data or small files, then might take longer time than normal java application. spark designed for huge data processing... On how much/size data you have tested ? – Giri Feb 12 '20 at 10:46
  • Hi Giri - thanks for your comment. Yes, spark is designed for large data set as well as small data set. Here, mostly I am looking for significant difference between using `java -jar` and `spark-submit`! – Sandeep Singh Feb 12 '20 at 16:08
  • 3
    In short, both options are equivalent. `spark-submit` is a thin wrapper around `spark-class` (https://github.com/apache/spark/blob/master/bin/spark-class) which in turn calls `java -jar ...` with certain JVM options, classpath, etc. usually sourced from files like `load-spark-env.sh` and `spark-defaults.conf`. If you're noticing any difference in performance, you should be able to set values in your Spark configuration files or on the command line to bring the two into line. – Charlie Flowers Feb 12 '20 at 17:02

0 Answers0