0

Is it an anti-pattern to do multi-node Spot-enabled distributed GPU training on SageMaker?

I'm afraid that several issues will slow things down or even make them infeasible:

  • the interruption detection lag
  • the increased probability of interruption (N instances)
  • the need to re-download data at every interruption
  • the need start/stop whole clusters instead of just replacing interrupted nodes
  • the fact that Sagemaker doesn' support variable size cluster

Additionally EC2-Spot documentation deters users from using Spot in multi-node workflows where nodes are tightly coupled (which is the case in data-parallel and model-parallel training) "Spot Instances are not suitable for workloads that are inflexible, stateful, fault-intolerant, or tightly coupled between instance nodes."

Anybody here have experience doing Spot-enabled distributed GPU training on SageMaker happily?

juvchan
  • 6,113
  • 2
  • 22
  • 35

1 Answers1

0

Short answer is that Spot training works well when the instance type you need, in the region you need, has enough free capacity, at a particular time. Otherwise you won't be able to start the job, or get too frequent interruptions.

Why not just try it for yourself? Once you have a working on-demand training job, you can enable spot training by adding 3 relevant parameters to the job's Estimator definition, and implement checkpoint save/load (good to have anyway). Then if it works well, great! If not, switch back.

Gili Nachum
  • 5,288
  • 4
  • 31
  • 33