Is SageMaker multi-node Spot-enabled GPU training an anti-pattern?

Question

Is it an anti-pattern to do multi-node Spot-enabled distributed GPU training on SageMaker?

I'm afraid that several issues will slow things down or even make them infeasible:

the interruption detection lag
the increased probability of interruption (N instances)
the need to re-download data at every interruption
the need start/stop whole clusters instead of just replacing interrupted nodes
the fact that Sagemaker doesn' support variable size cluster

Additionally EC2-Spot documentation deters users from using Spot in multi-node workflows where nodes are tightly coupled (which is the case in data-parallel and model-parallel training) "Spot Instances are not suitable for workloads that are inflexible, stateful, fault-intolerant, or tightly coupled between instance nodes."

Anybody here have experience doing Spot-enabled distributed GPU training on SageMaker happily?

score 0 · Answer 1 · answered Oct 16 '22 at 12:27

Short answer is that Spot training works well when the instance type you need, in the region you need, has enough free capacity, at a particular time. Otherwise you won't be able to start the job, or get too frequent interruptions.

Why not just try it for yourself? Once you have a working on-demand training job, you can enable spot training by adding 3 relevant parameters to the job's Estimator definition, and implement checkpoint save/load (good to have anyway). Then if it works well, great! If not, switch back.

Is SageMaker multi-node Spot-enabled GPU training an anti-pattern?

1 Answers1