I'm using Spark on a cluster in Standalone mode.
I'm currently working on a Spark Streaming application. I've added checkpoints for the system in order to deal with the master process suddenly failing and I see that it's working well.
My question is: what happens if the entire node crashes (power failure, hardware error etc), is there a way to automatically identify failing nodes in the cluster and if so restart them on the same machine (or restart them on a different machine instead)
I've looked at monit but it seems to be running on a specific machine and restart failing processes while I need to do the same thing but over nodes. Just to be clear, I don't mind if the restart operation will take a little bit of time but I would prefer it to happen automatically
Is there any way to do this?
Thanks in advance