Highest Voted 'amz-sagemaker-distributed-training' Questions

1

vote

1 answer

Distributed training on PyTorch and Spot checkpoints in SageMaker

I'm building custom model on PyTorch and want to know how to implement snapshot logic for distributed training. If a model is trained on multiple spot instances and the model is implemented on BYO PyTorch image, how dpes Sagemaker know which…

asked Oct 15 '22 at 00:54

juvchan

6,113
2
22
35

1

vote

1 answer

Add Security groups in Amazon SageMaker for distributed training jobs

We would like to enforce specific security groups to be set on the SageMaker training jobs (XGBoost in script mode). However, distributed training, in this case, won’t work out of the box, since the containers need to communicate with each other.…

amazon-web-services xgboost amazon-sagemaker distributed-training amz-sagemaker-distributed-training

asked Sep 09 '22 at 14:20

Philipp Schmid

126
7

1

vote

1 answer

Amazon SageMaker multi GPU: No objective found

I have a question on Sagemaker multi GPU - IHAC running their code in single gpu instances (ml.p3.2xlarge) but when they select ml.p3.8xlarge(multi gpu), it is running into the following error: “Failure reason: No objective metrics found after…

amazon-sagemaker amz-sagemaker-distributed-training

asked Sep 08 '22 at 15:45

Philipp Schmid

126
7

0

votes

0 answers

Why do people still bother using distributed computing products like AnyScale and AWS SageMaker while EC2 can provide a super large instance?

Let's say someone wants to train a neural network model on 50GB of data, he/she can just use an AWS EC2 instances with a large number of CPUs and a large memory. The largest AWS EC2 instance provides 448 vCPUs and more 12TB of memory, would do some…

machine-learning artificial-intelligence distributed-computing ray amz-sagemaker-distributed-training

asked Jun 12 '23 at 09:43

Owen

19
4

0

votes

1 answer

How can I save a model from a Sagemaker Pipelines TrainingStep in a specific location i.e. without the unique parent folder?

I know that the TrainingStep saves the model as output by default, but I want to save it in a specific place in my S3 bucket. I need a way to programmatically find where a model is stored, so I want to get rid of the unique parent directory…

amazon-s3 amazon-sagemaker amz-sagemaker-distributed-training

asked Jun 01 '23 at 22:01

Progress

117
1
9

0

votes

1 answer

How to Train SageMaker job with data coming from FSx for Lustre

I am trying to implement the following example: https://medium.com/@sayons/transfer-learning-with-amazon-sagemaker-and-fsx-for-lustre-378fa8977cc1 but I am getting the following error: UnexpectedStatusException: Error for Training job…

amazon-sagemaker lustre amz-sagemaker-distributed-training

asked Apr 14 '23 at 04:04

sebtac

538
5
8

0

votes

0 answers

How can I implement auto-scaling for a SageMaker PySpark processing job?

I am attempting to set up a SageMaker PySpark processing job that can scale workers in or out automatically, based on the required processing power. Unfortunately, this functionality is not currently available in SageMaker. However, I have…

boto3 amazon-sagemaker amz-sagemaker-distributed-training

asked Mar 06 '23 at 15:37

Rodrigo Luís

1

0

votes

0 answers

Pytorch Lightening not using all resources

I am running the lab 1 example as it is. Everything goes fine and training succeeds. But when I check the training logs, Its is all happening on [1,mpirank:0,algo-1]. I am passing the instance_count as two and can see there are two hosts [algo-1 and…

pytorch amazon-sagemaker pytorch-lightning distributed-training amz-sagemaker-distributed-training

asked Mar 05 '23 at 12:15

souraj

13
2

0

votes

1 answer

How to properly use ShardedByS3Key in distributed training scenario?

Following the API reference, one way to optimize data ingestion for distributed training is using ShardedByS3Key. Does have code samples for using ShardedByS3Key in context of distributed training? Concretely, what changes to, e.g.,…

tensorflow amazon-sagemaker amz-sagemaker-distributed-training

asked Oct 31 '22 at 12:57

Philipp Schmid

126
7

0

votes

1 answer

Is SageMaker multi-node Spot-enabled GPU training an anti-pattern?

Is it an anti-pattern to do multi-node Spot-enabled distributed GPU training on SageMaker? I'm afraid that several issues will slow things down or even make them infeasible: the interruption detection lag the increased probability of interruption…

amazon-sagemaker spot-instances amz-sagemaker-distributed-training

asked Oct 15 '22 at 03:46

juvchan

6,113
2
22
35

0

votes

0 answers

Distributed Spark on Amazon SageMaker

I have built a SparkML collaborative filtering algorithm that I want to train and deploy on Sagemaker. What is the best way to achieve this other than BYOC? Also, I want to understand how distributed training works in Sagemaker if we go with the…

amazon-web-services apache-spark amazon-sagemaker amz-sagemaker-distributed-training

asked Oct 10 '22 at 05:46

Philipp Schmid

126
7

0

votes

1 answer

Distributed Unsupervised Learning in SageMaker

I am running local unsupervised learning (predominantly clustering) on a large, single node with GPU. Does SageMaker support distributed unsupervised learning using clustering? If yes, please provide the relevant example (preferably non-TensorFlow).

unsupervised-learning distributed-training amazon-machine-learning amz-sagemaker-distributed-training

asked Sep 17 '22 at 08:16

juvchan

6,113
2
22
35

0

votes

2 answers

Why does SageMaker PyTorch DDP init times out on SageMaker?

I'm using PyTorch DDP on SageMaker PyTorch Training DLC 1.8.1 The code seems properly DDP-formatted. I'm using instance_count = 2, and launching torch.distributed.launch and I believe the ranks and world size are properly set however…

pytorch amazon-sagemaker distributed-training amz-sagemaker-distributed-training

asked Sep 09 '22 at 19:51

Philipp Schmid

126
7

0

votes

1 answer

Distributed training example for Temporal Fusion Transformer in SageMaker

We’re training a big Temporal Fusion Transformer using PyTorch. We’re looking into using Distributed Training and accelerate training jobs with SageMaker. Does anyone have any examples of this? Any pattern you can recommend?

amazon-web-services pytorch amazon-sagemaker amz-sagemaker-distributed-training

asked Sep 09 '22 at 12:01

Philipp Schmid

126
7

0

votes

1 answer

Why Does SageMaker Data Parallel Distributed Training Only Support 3 Instances types?

I see here that SageMaker Data Distributed Library only supports 3 instance types: ml.p3.16xlarge, ml.p3dn.24xlarge, ml.p4d.24xlarge. Why is this? I would have thought there might be use cases for parallel training for other GPUs, and even…

amazon-web-services amazon-sagemaker distributed-training amz-sagemaker-distributed-training

asked Sep 09 '22 at 08:33

Philipp Schmid

126
7

Questions tagged [amz-sagemaker-distributed-training]