1

Below is my spark-submit command

/usr/bin/spark-submit  \
  --class "<class_name>" \
  --master yarn \
  --queue default \
  --deploy-mode cluster \
  --conf "spark.driver.extraJavaOptions=-DENVIRONMENT=pt -Dhttp.proxyHost=<proxy_ip> -Dhttp.proxyPort=8080 -Dhttps.proxyHost=<proxy_ip> -Dhttps.proxyPort=8080" \
  --conf "spark.executor.extraJavaOptions=-DENVIRONMENT=pt -Dhttp.proxyHost=<proxy_ip> -Dhttp.proxyPort=8080 -Dhttps.proxyHost=<proxy_ip> -Dhttps.proxyPort=8080" \
  --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0 \
  --driver-memory 3G \
  --executor-memory 4G \
  --num-executors 2 \
  --executor-cores 3 <jar_file>

The spark-submit command timesout on resolving the package dependency

Replacing --packages with --jar works but I would like to get to the bottom of why --packages is not working for me. Also for http.proxyHost and https.proxyHost I specify only the ip address without http:// or https://?

Edit

Please note the following

  • The machine I am deploying from and the spark cluster is behind http proxy
  • I know what the difference between --jars and --packages is. I want to get the --packages option to work in my case.
  • I have tested the http proxy settings for my machine. I can reach out to the internet from my machine. I can do a curl. For some reason it feels like spark-submit is not picking up the http proxy setting
halfer
  • 19,824
  • 17
  • 99
  • 186
Abdul Rahman
  • 1,294
  • 22
  • 41

1 Answers1

0

The difference between --packages and --jar in a nutshell, is that --packages use maven to resolve the packages you have provided and --jars is a list of jars to be included in the classpath which means you have to make sure those jars are also available in the executor nodes while with --packages you should also ensure you have maven installed and working in every node

More detailed info can be found on spark-submit help

--jars JARS Comma-separated list of jars to include on the driver and executor classpaths.

--packages Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths. Will search the local maven repo, then maven central and any additional remote repositories given by --repositories. The format for the coordinates should be groupId:artifactId:version.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • my question was why is the `--packages` option not working for me? Also note that the machine I am deploying from and the cluster is behind a http proxy – Abdul Rahman Nov 13 '19 at 11:36