I have a requirement to use N 1x GPU Spot instances instead of 1x N-GPU instance for distributed training.
Does SageMaker Distributed Training support the use of GPU Spot instance(s)? If yes, how to enable it?
I have a requirement to use N 1x GPU Spot instances instead of 1x N-GPU instance for distributed training.
Does SageMaker Distributed Training support the use of GPU Spot instance(s)? If yes, how to enable it?
Yes Amazon SageMaker distributed training supports spot instance you can enable it the same way as regular training job. Add the below parameters to your estimator and call the fit method.
use_spot_instances=True,
max_wait = <x_in_seconds>,
max_run= <x_in_seconds>
For your scenario, it wouldn't be beneficial to scale with N nodes of 1 GPU's as there is some amount of time lost in inter GPU communication between nodes. The recommendation is to scale vertically (use multi GPU instances first) before scaling horizontally.