How does tensorflow MultiWorkerMirroredStrategy work during autoscaling and failure if you have to configure cluster_resolver?

Question

It seems like I have to configure cluster_resolver before running training to enable distributed training on multiple worker

But how does that work with autoscaling and node failures?

https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/MultiWorkerMirroredStrategy

I am using databricks for reference

score 1 · Answer 1 · edited May 17 '23 at 05:48

On Databricks, It is a best practice to disable autoscaling during any sort of distributed training, whether using multi-worker mirror strategy on Tensorflow or Data Parallel processing on Pytorch. Or scaling training using Horovod. The same applies for hyperparameter tuning with hyperopt.

For these sort of tasks(distributed training and hyperparameter optimization) on Databricks, it will be helpful to avoid using Spot instances as well or at least switch to instance types where preemption is readily available.

How does tensorflow MultiWorkerMirroredStrategy work during autoscaling and failure if you have to configure cluster_resolver?

1 Answers1