Questions tagged [amz-sagemaker-distributed-training]
19 questions
0
votes
1 answer
Is SageMaker Distributed Data-Parallel (SMDDP) supported for keras models?
Is SageMaker Distributed Data-Parallel (SMDDP) supported for keras models?
In documentation it says "SageMaker distributed data parallel is adaptable to TensorFlow training scripts composed of tf core modules except tf.keras modules. SageMaker…

Philipp Schmid
- 126
- 7
0
votes
0 answers
GPU 0 utilization higher than other GPUs on Amazon SageMaker SMDP (distributed Training)
When using SageMaker Data Parallelism (SMDP), my team sees a higher utilization on GPU 0 compared to other GPUs.
What can be the likely cause here?
Does it have anything to do with the data loader workers that run on CPU? I would expect SMDP to…

Philipp Schmid
- 126
- 7
0
votes
1 answer
Use PyTorch DistributedDataParallel with Hugging Face on Amazon SageMaker
Even for single-instance training, PyTorch DistributedDataParallel (DDP) is generally recommended over PyTorch DataParallel (DP) because DP's strategy is less performant and it uses more memory on the default device. (Per this PyTorch forums…

Philipp Schmid
- 126
- 7
0
votes
1 answer
Create Hugging Face Transformers Tokenizer using Amazon SageMaker in a distributed way
I am using the SageMaker HuggingFace Processor to create a custom tokenizer on a large volume of text data.
Is there a way to make this job data distributed - meaning read partitions of data across nodes and train the tokenizer leveraging multiple…

Philipp Schmid
- 126
- 7