Fault tolerance / Error handling by spark-submit

Question

I have a spark job which I'm running using the following command:

sudo ./bin/spark-submit --jars lib/spark-streaming-kafka-assembly_2.10-1.4.1.jar \
--packages TargetHolding:pyspark-cassandra:0.2.4 \
examples/src/main/python/final/kafka-sparkstreaming-cassandra.py

However, sometimes the job fails for some reason after it runs for 2 continuous days per say and I have to manually start it after that.

This is highly inefficient for my purpose because I am continuously reading data from kafka and saving it to cassandra.

What feature does spark have for such fault tolerance? Maybe spark-submit could be launch again? Maybe there's something smarter? I tried to google this but there is very little information regarding this.

P.S. - I am using spark 1.4.1.

I hope to receive some good ideas!

Thanks!

You have application parameter restart on Yarn level, and configurable spark parameter for task retries. The main question is why your job is constantly failing, the fact that you have fault tolerance doesn't mean that you will necessary have 24/7 application. Focus on why your job is failing and how you can fix it. If you experience some fundamental problem (Cassandra is down / Reaching nodes resources limit ...) what you expect Spark (or any other tool if that matters) to do? — Michael Kopaniov, Mar 21 '16 at 10:32
Usually a job can fail because of out of memory exceptions or other not handled exceptions. You should look into the Spark UI, your logs and or configure Spark history server to take a look at the UI after the job is dead. In your case probably it would help to collect logs or application metrics in realtime to ES/Kibana or Splunk. That will you give you more insights and let you fix the issue. If you are running under Yarn there is a limit on the consecutive number of task that can fail before the entire job is considered failed (4 by default if I remember correctly). — PinoSan, Mar 21 '16 at 23:10

Fault tolerance / Error handling by spark-submit

0 Answers0