Why is spark-submit job leaving a process running on cluster (EMR) master node?

Question

I am submitting a spark job to Livy through an AWS lambda function. The job runs to the end of the driver program but then does not shutdown.

If spark.stop() or sc.stop() is added to the end of the driver program, the spark job finishes on the YARN resource manager and Livy will report a success. However, there is still a livy process running on the master node which takes around 1.5Gb of memory. If many jobs are submitted this eventually uses and holds all of the master node memory.

The job:

Pulls records from a hive table
Collects these records on the master node and then writes them to a pdf file using apache pdfbox
Uploads the resulting PDF to S3

Directly running spark-submit on the cluster produces the same results, however if I ctrl+c whilst the spark-submit job is running, the process on the master node is ended.

We are expecting the job to finish by itself when reaching the end of the driver program. If not this, the shutdown hook should be called when spark.stop() is called.

I'm sure you have either found a solution or moved on, but I had a similar issue. In my case the spark job was writing to an aws queue which returned a java future and the future had not completed before the job terminated (context.stop() etc.) so the driver program hung. Once I changed to a blocking call everything worked as expected. hope this gives you some clues — Andersondk7, Nov 15 '19 at 15:33

score 1 · Answer 1 · answered Jan 07 '20 at 23:32

have u tried to enable this flag on the spark configurations? spark.yarn.submit.waitAppCompletion=false

What i observed is that livy does a spark-submit command. And the above flag makes sure that the command completes once yarn application creates an applicationId

Why is spark-submit job leaving a process running on cluster (EMR) master node?

1 Answers1