Questions tagged [amz-sagemaker-distributed-training]
19 questions
1
vote
1 answer
Distributed training on PyTorch and Spot checkpoints in SageMaker
I'm building custom model on PyTorch and want to know how to implement snapshot logic for distributed training.
If a model is trained on multiple spot instances and the model is implemented on BYO PyTorch image, how dpes Sagemaker know which…

juvchan
- 6,113
- 2
- 22
- 35
1
vote
1 answer
Add Security groups in Amazon SageMaker for distributed training jobs
We would like to enforce specific security groups to be set on the SageMaker training jobs (XGBoost in script mode).
However, distributed training, in this case, won’t work out of the box, since the containers need to communicate with each other.…

Philipp Schmid
- 126
- 7
1
vote
1 answer
Amazon SageMaker multi GPU: No objective found
I have a question on Sagemaker multi GPU - IHAC running their code in single gpu instances (ml.p3.2xlarge) but when they select ml.p3.8xlarge(multi gpu), it is running into the following error:
“Failure reason: No objective metrics found after…

Philipp Schmid
- 126
- 7
0
votes
0 answers
Why do people still bother using distributed computing products like AnyScale and AWS SageMaker while EC2 can provide a super large instance?
Let's say someone wants to train a neural network model on 50GB of data, he/she can just use an AWS EC2 instances with a large number of CPUs and a large memory. The largest AWS EC2 instance provides 448 vCPUs and more 12TB of memory, would do some…

Owen
- 19
- 4
0
votes
1 answer
How can I save a model from a Sagemaker Pipelines TrainingStep in a specific location i.e. without the unique parent folder?
I know that the TrainingStep saves the model as output by default, but I want to save it in a specific place in my S3 bucket. I need a way to programmatically find where a model is stored, so I want to get rid of the unique parent directory…

Progress
- 117
- 1
- 9
0
votes
1 answer
How to Train SageMaker job with data coming from FSx for Lustre
I am trying to implement the following example:
https://medium.com/@sayons/transfer-learning-with-amazon-sagemaker-and-fsx-for-lustre-378fa8977cc1
but I am getting the following error:
UnexpectedStatusException: Error for Training job…

sebtac
- 538
- 5
- 8
0
votes
0 answers
How can I implement auto-scaling for a SageMaker PySpark processing job?
I am attempting to set up a SageMaker PySpark processing job that can scale workers in or out automatically, based on the required processing power. Unfortunately, this functionality is not currently available in SageMaker. However, I have…
0
votes
0 answers
Pytorch Lightening not using all resources
I am running the lab 1 example as it is. Everything goes fine and training succeeds. But when I check the training logs, Its is all happening on [1,mpirank:0,algo-1]. I am passing the instance_count as two and can see there are two hosts [algo-1 and…

souraj
- 13
- 2
0
votes
1 answer
How to properly use ShardedByS3Key in distributed training scenario?
Following the API reference, one way to optimize data ingestion for distributed training is using ShardedByS3Key.
Does have code samples for using ShardedByS3Key in context of distributed training? Concretely, what changes to, e.g.,…

Philipp Schmid
- 126
- 7
0
votes
1 answer
Is SageMaker multi-node Spot-enabled GPU training an anti-pattern?
Is it an anti-pattern to do multi-node Spot-enabled distributed GPU training on SageMaker?
I'm afraid that several issues will slow things down or even make them infeasible:
the interruption detection lag
the increased probability of interruption…

juvchan
- 6,113
- 2
- 22
- 35
0
votes
0 answers
Distributed Spark on Amazon SageMaker
I have built a SparkML collaborative filtering algorithm that I want to train and deploy on Sagemaker. What is the best way to achieve this other than BYOC?
Also, I want to understand how distributed training works in Sagemaker if we go with the…

Philipp Schmid
- 126
- 7
0
votes
1 answer
Distributed Unsupervised Learning in SageMaker
I am running local unsupervised learning (predominantly clustering) on a large, single node with GPU.
Does SageMaker support distributed unsupervised learning using clustering?
If yes, please provide the relevant example (preferably non-TensorFlow).

juvchan
- 6,113
- 2
- 22
- 35
0
votes
2 answers
Why does SageMaker PyTorch DDP init times out on SageMaker?
I'm using PyTorch DDP on SageMaker PyTorch Training DLC 1.8.1 The code seems properly DDP-formatted. I'm using instance_count = 2, and launching torch.distributed.launch and I believe the ranks and world size are properly set however…

Philipp Schmid
- 126
- 7
0
votes
1 answer
Distributed training example for Temporal Fusion Transformer in SageMaker
We’re training a big Temporal Fusion Transformer using PyTorch.
We’re looking into using Distributed Training and accelerate training jobs with SageMaker.
Does anyone have any examples of this? Any pattern you can recommend?

Philipp Schmid
- 126
- 7
0
votes
1 answer
Why Does SageMaker Data Parallel Distributed Training Only Support 3 Instances types?
I see here that SageMaker Data Distributed Library only supports 3 instance types: ml.p3.16xlarge, ml.p3dn.24xlarge, ml.p4d.24xlarge.
Why is this? I would have thought there might be use cases for parallel training for other GPUs, and even…

Philipp Schmid
- 126
- 7