It seems like I have to configure cluster_resolver
before running training to enable distributed training on multiple worker
But how does that work with autoscaling and node failures?
https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/MultiWorkerMirroredStrategy
I am using databricks for reference