Is it an anti-pattern to do multi-node Spot-enabled distributed GPU training on SageMaker?
I'm afraid that several issues will slow things down or even make them infeasible:
- the interruption detection lag
- the increased probability of interruption (N instances)
- the need to re-download data at every interruption
- the need start/stop whole clusters instead of just replacing interrupted nodes
- the fact that Sagemaker doesn' support variable size cluster
Additionally EC2-Spot documentation deters users from using Spot in multi-node workflows where nodes are tightly coupled (which is the case in data-parallel and model-parallel training) "Spot Instances are not suitable for workloads that are inflexible, stateful, fault-intolerant, or tightly coupled between instance nodes."
Anybody here have experience doing Spot-enabled distributed GPU training on SageMaker happily?