I'd like to execute a Spark job, via an HTTP call from outside the cluster using Livy, where the Spark jar already exists in HDFS.
I'm able to spark-submit
the job from shell on the cluster nodes, e.g.:
spark-submit --class io.woolford.Main --master yarn-cluster hdfs://hadoop01:8020/path/to/spark-job.jar
Note that the --master yarn-cluster
is necessary to access HDFS where the jar resides.
I'm also able to submit commands, via Livy, using curl
. For example, this request:
curl -X POST --data '{"file": "/path/to/spark-job.jar", "className": "io.woolford.Main"}' -H "Content-Type: application/json" hadoop01:8998/batches
... executes the following command on the cluster:
spark-submit --class io.woolford.Main hdfs://hadoop01:8020/path/to/spark-job.jar
This is the same as the command that works, minus the --master yarn-cluster
params. This was verified by tailing /var/log/livy/livy-livy-server.out
.
So, I just need to modify the curl
command to include --master yarn-cluster
when it's executed by Livy. At first glance, it seems like this should be possible by adding arguments to the JSON dictionary. Unfortunately, these aren't passed through.
Does anyone know how to pass --master yarn-cluster
to Livy so that jobs are executed on YARN without making systemwide changes?