Highest Voted 'amz-sagemaker-distributed-training' Questions

0

votes

1 answer

Is SageMaker Distributed Data-Parallel (SMDDP) supported for keras models?

Is SageMaker Distributed Data-Parallel (SMDDP) supported for keras models? In documentation it says "SageMaker distributed data parallel is adaptable to TensorFlow training scripts composed of tf core modules except tf.keras modules. SageMaker…

asked Sep 09 '22 at 05:56

Philipp Schmid

126
7

0

votes

0 answers

GPU 0 utilization higher than other GPUs on Amazon SageMaker SMDP (distributed Training)

When using SageMaker Data Parallelism (SMDP), my team sees a higher utilization on GPU 0 compared to other GPUs. What can be the likely cause here? Does it have anything to do with the data loader workers that run on CPU? I would expect SMDP to…

amazon-web-services amazon-sagemaker distributed-training amz-sagemaker-distributed-training

asked Sep 08 '22 at 18:21

Philipp Schmid

126
7

0

votes

1 answer

Use PyTorch DistributedDataParallel with Hugging Face on Amazon SageMaker

Even for single-instance training, PyTorch DistributedDataParallel (DDP) is generally recommended over PyTorch DataParallel (DP) because DP's strategy is less performant and it uses more memory on the default device. (Per this PyTorch forums…

pytorch amazon-sagemaker huggingface-transformers amz-sagemaker-distributed-training

asked Sep 08 '22 at 09:03

Philipp Schmid

126
7

0

votes

1 answer

Create Hugging Face Transformers Tokenizer using Amazon SageMaker in a distributed way

I am using the SageMaker HuggingFace Processor to create a custom tokenizer on a large volume of text data. Is there a way to make this job data distributed - meaning read partitions of data across nodes and train the tokenizer leveraging multiple…

amazon-sagemaker huggingface-transformers huggingface-tokenizers amz-sagemaker-distributed-training

asked Sep 08 '22 at 07:17

Philipp Schmid

126
7

Questions tagged [amz-sagemaker-distributed-training]

Is SageMaker Distributed Data-Parallel (SMDDP) supported for keras models?

GPU 0 utilization higher than other GPUs on Amazon SageMaker SMDP (distributed Training)

Use PyTorch DistributedDataParallel with Hugging Face on Amazon SageMaker

Create Hugging Face Transformers Tokenizer using Amazon SageMaker in a distributed way