I am running AWS EMR cluster (emr-5.30.1, Spark 2.4.5, Livy 0.7.0). My service pass jobs to livy and livy execute "spark-submit" in cluster mode to submit jobs to yarn. spark master is 8 core, 16GB machine.
I see dead jobs when ~15-20 jobs are submitted at once to livy. livy logs shows "spark-submit exited with code 143", suggesting the process was killed by kernel or oom handler. I am not able to find more logs anywhere for killed processes. monitoring master node when these jobs are submitted shows ~100% cpu and using ~80% memory.
I tried using a 32GB master. This node can handle 15-20 parallel submitted jobs but fails when the parallel jobs goes above ~30.
To solve this problem I am thinking to put a queue in my service and then gradually passing (a job every 8-10 seconds) jobs to livy. I am reluctant in adding a queue as this needs to be a distributed one.
I have few questions here
- This seems a insufficient memory problem but I don’t see explicit logs. Can i conclude this to be a memory error?
- What other alternative solutions/approach can be used to fix this.