3

We're building a Spark application in Scala with a HOCON configuration, the config is called application.conf.

If I add the application.conf to my jar file and start a job on Google Dataproc, it works correctly:

gcloud dataproc jobs submit spark \
  --cluster <clustername> \
  --jar=gs://<bucketname>/<filename>.jar \
  --region=<myregion> \
  -- \
  <some options>

I don't want to bundle the application.conf with my jar file but provide it separately, which I can't get working.

Tried different things, i.e.

  1. Specifying the application.conf with --jars=gs://<bucketname>/application.conf (which should work according to this answer)
  2. Using --files=gs://<bucketname>/application.conf
  3. Same as 1. + 2. with the application conf in /tmp/ on the Master instance of the cluster, then specifying the local file with file:///tmp/application.conf
  4. Defining extraClassPath for spark using --properties=spark.driver.extraClassPath=gs://<bucketname>/application.conf (and for executors)

With all these options I get an error, it can't find the key in the config:

Exception in thread "main" com.typesafe.config.ConfigException$Missing: system properties: No configuration setting found for key 'xyz'

This error usually means that there's an error in the HOCON config (key xyz is not defined in HOCON) or that the application.conf is not in the classpath. Since the exact same config is working when inside my jar file, I assume it's the latter.

Are there any other options to put the application.conf on the classpath?

Igor Dvorzhak
  • 4,360
  • 3
  • 17
  • 31
pgruetter
  • 1,184
  • 1
  • 11
  • 29

1 Answers1

2

If --jars doesn't work as suggested in this answer, you can try init action. First upload your config to GCS, then write an init action to download it to the VMs, putting it to a folder in the classpath or update spark-env.sh to include the path to the config.

Dagang
  • 24,586
  • 26
  • 88
  • 133
  • Thanks, that pointed me in the right direction! If I understand correctly, the init action is executed only once when the cluster or node is created. For the application.conf we want to be able to change it frequently. Changing the spark-env.sh through gcloud is also only possible when creating the cluster (through --parameter), but here we can define a folder where all configs will be saved later, that should work. To test this, I manually added spark.driver.extraClassPath and spark.executor.extraClassPath to /etc/spark/conf.dist/spark-defaults.conf. Now it works! – pgruetter Oct 09 '19 at 08:05