1

I have a pyspark job running via qubole which fails with the following error.

Qubole > Shell Command failed, exit code unknown
Qubole > 2016-12-03 17:36:53,097 ERROR shellcli.py:231 - run - Retrying exception reading mapper output: (22, 'The requested URL returned error: 404 Not Found')

Qubole > 2016-12-03 17:36:53,358 ERROR shellcli.py:262 - run - Retrying exception reading mapper logs: (22, 'The requested URL returned error: 404 Not Found')

The job is run with the following configurations :

--num-executors 38 --executor-cores 2 --executor-memory 12288M --driver-memory 4000M --conf spark.storage.memoryFraction=0.3 --conf spark.yarn.executor.memoryOverhead=1024

Cluster contains 30 slave count. m2.2xlarge, 4 cores master and slave nodes.

Any insights on the root cause of the issue will be useful.

1 Answers1

0

In many cases - above error is really not the main reason of failure. In qubole the spark job is submitted via a shellCli ( 1 mapper command which invokes the main pyspark job using spark-submit on one of the slave nodes ) - and since the same shellCli process invokes the driver in yarn-client mode - often times if this process goes bad due to any reason ( i.e. memory issues with driver ) then you might hit this issue. Other less probable reason could be - network connectivity where qubole tier is unable to connect to the process/slave node where this 1 mapper invoker job is running.

Ashish Dubey
  • 136
  • 2