Questions tagged [distributed-training]
83 questions
0
votes
1 answer
Best Practices for Distributed Training with PyTorch custom containers (BYOC) in SageMaker
What are the best practices for distributed training with PyTorch custom containers (BYOC) in Amazon Sagemaker? I understand that PyTorch framework supports native distributed training or using the Horovod library for PyTorch.

juvchan
- 6,113
- 2
- 22
- 35
0
votes
1 answer
Can Horovod with TensorFlow work on non-GPU instances in Amazon SageMaker?
I want to perform distributed training on Amazon SageMaker. The code is written with TensorFlow and similar to the following code where I think CPU instance should be…

juvchan
- 6,113
- 2
- 22
- 35
0
votes
1 answer
How to use multiple instances with the SageMaker XGBoost built-in algorithm?
If we use multiple instances for training will the built-in algorithm automatically exploit it? For example, what if we used 2 instances for training using built-in XGBoost container and we used the same customer churn example? Will one instance be…

Kyle Gallatin
- 146
- 6
0
votes
2 answers
Why does SageMaker PyTorch DDP init times out on SageMaker?
I'm using PyTorch DDP on SageMaker PyTorch Training DLC 1.8.1 The code seems properly DDP-formatted. I'm using instance_count = 2, and launching torch.distributed.launch and I believe the ranks and world size are properly set however…

Philipp Schmid
- 126
- 7
0
votes
1 answer
Why Does SageMaker Data Parallel Distributed Training Only Support 3 Instances types?
I see here that SageMaker Data Distributed Library only supports 3 instance types: ml.p3.16xlarge, ml.p3dn.24xlarge, ml.p4d.24xlarge.
Why is this? I would have thought there might be use cases for parallel training for other GPUs, and even…

Philipp Schmid
- 126
- 7
0
votes
0 answers
GPU 0 utilization higher than other GPUs on Amazon SageMaker SMDP (distributed Training)
When using SageMaker Data Parallelism (SMDP), my team sees a higher utilization on GPU 0 compared to other GPUs.
What can be the likely cause here?
Does it have anything to do with the data loader workers that run on CPU? I would expect SMDP to…

Philipp Schmid
- 126
- 7
0
votes
0 answers
In tensorFlow1.x distributed PS + Worker training, does worker halt each other when doing sess.run()?
This is a general question regarding PS + Workers training paradigm in TensorFlow. Suppose this scenario:
1 PS + 2 Workers are training asynchronizely(suppose they have different training speed) and suppose their graphs are all something like input…

Interfish
- 11
- 3
0
votes
1 answer
Data parallelism on multiple GPUs
I am trying to train a model using data parallelism on multiple GPUs on a single machine. As I think, in data parallelism, we divide the data into batches, and then batches are deployed parallel. Afterward, the average gradient is calculated based…

Ahmad
- 645
- 2
- 6
- 21
0
votes
0 answers
How to train mnist data with tensorflow ParameterServerStrategy distributed training?
I'm trying to train the mnist dataset using the ParameterServerStrategy. As a beginner, I find the documentations to be confusing specially when it comes to the section "Clusters in the real world". This is the docs that I'm…

cosmicRover
- 23
- 6
0
votes
0 answers
Keras model.fit throws Segmentation Fault with error- libprotobuf FATAL CHECK failed: (value.size()) <= (kint32max)
I am trying to train a simple tensorflow model on emr cluster with around 9000 parameters. But When I try to train the model it throws following error. I tried increasing the memory and decreasing the batch size. But it didn't help.
libprotobuf…

Mukul
- 310
- 1
- 6
- 13
0
votes
1 answer
Distributed sequential windowed data in pytorch
At every epoch of my training, I need to split my dataset in n batches of t consecutive samples. For example, if my data is [1,2,3,4,5,6,7,8,9,10], n = 2 and t = 3 then valid batches would be
[1-2-3, 4-5-6] and [7-8-9, 10-1-2]
[2-3-4, 8-9-10] and…

Simon
- 5,070
- 5
- 33
- 59
0
votes
0 answers
RuntimeError while running get_weights() in strategy.run in tensorflow
I am new to tf.distribute and I do not know how to directly get weights of a model in memory. I put my sample code below, and it gives a RuntimeError.
import os
import json
# Dump the cluster information to `'TF_CONFIG'`.
tf_config = {
…

Jacob975
- 1
- 1
0
votes
1 answer
PyTorch Lightning multi node training error on GCP
We are currently working on a project that involves training on Pytorch Lightning. The code utilizes GPUs through DistributedDataParallel (DDP). Currently, it is working fine while running on a single machine of Vertex AI Training job and/or on…

Yasser H
- 1
- 1
0
votes
1 answer
Sagemaker Distributed Data Parallelism not working as expected ( smdistributed.dataparallel.torch.distributed )
All,
I was trying the AWS sagemaker data parallelism approach for the distributed training ( using the two lib ) from smdistributed.dataparallel.torch.parallel.distributed import DistributedDataParallel as DDP
import…
0
votes
0 answers
pyotrch distributed: Running shell command
I'm running a distributed pytorch training. Everything works like charm. I am fully utilizing all GPUs, all processes are in sync, everything is fine.
At the end of each epoch, I want to run some elaborate evaluation in a new process (not to block…

Shai
- 111,146
- 38
- 238
- 371