Questions tagged [distributed-training]

83 questions
0
votes
1 answer

Best Practices for Distributed Training with PyTorch custom containers (BYOC) in SageMaker

What are the best practices for distributed training with PyTorch custom containers (BYOC) in Amazon Sagemaker? I understand that PyTorch framework supports native distributed training or using the Horovod library for PyTorch.
0
votes
1 answer

Can Horovod with TensorFlow work on non-GPU instances in Amazon SageMaker?

I want to perform distributed training on Amazon SageMaker. The code is written with TensorFlow and similar to the following code where I think CPU instance should be…
juvchan
  • 6,113
  • 2
  • 22
  • 35
0
votes
1 answer

How to use multiple instances with the SageMaker XGBoost built-in algorithm?

If we use multiple instances for training will the built-in algorithm automatically exploit it? For example, what if we used 2 instances for training using built-in XGBoost container and we used the same customer churn example? Will one instance be…
0
votes
2 answers

Why does SageMaker PyTorch DDP init times out on SageMaker?

I'm using PyTorch DDP on SageMaker PyTorch Training DLC 1.8.1 The code seems properly DDP-formatted. I'm using instance_count = 2, and launching torch.distributed.launch and I believe the ranks and world size are properly set however…
0
votes
1 answer

Why Does SageMaker Data Parallel Distributed Training Only Support 3 Instances types?

I see here that SageMaker Data Distributed Library only supports 3 instance types: ml.p3.16xlarge, ml.p3dn.24xlarge, ml.p4d.24xlarge. Why is this? I would have thought there might be use cases for parallel training for other GPUs, and even…
0
votes
0 answers

GPU 0 utilization higher than other GPUs on Amazon SageMaker SMDP (distributed Training)

When using SageMaker Data Parallelism (SMDP), my team sees a higher utilization on GPU 0 compared to other GPUs. What can be the likely cause here? Does it have anything to do with the data loader workers that run on CPU? I would expect SMDP to…
0
votes
0 answers

In tensorFlow1.x distributed PS + Worker training, does worker halt each other when doing sess.run()?

This is a general question regarding PS + Workers training paradigm in TensorFlow. Suppose this scenario: 1 PS + 2 Workers are training asynchronizely(suppose they have different training speed) and suppose their graphs are all something like input…
0
votes
1 answer

Data parallelism on multiple GPUs

I am trying to train a model using data parallelism on multiple GPUs on a single machine. As I think, in data parallelism, we divide the data into batches, and then batches are deployed parallel. Afterward, the average gradient is calculated based…
0
votes
0 answers

How to train mnist data with tensorflow ParameterServerStrategy distributed training?

I'm trying to train the mnist dataset using the ParameterServerStrategy. As a beginner, I find the documentations to be confusing specially when it comes to the section "Clusters in the real world". This is the docs that I'm…
0
votes
0 answers

Keras model.fit throws Segmentation Fault with error- libprotobuf FATAL CHECK failed: (value.size()) <= (kint32max)

I am trying to train a simple tensorflow model on emr cluster with around 9000 parameters. But When I try to train the model it throws following error. I tried increasing the memory and decreasing the batch size. But it didn't help. libprotobuf…
Mukul
  • 310
  • 1
  • 6
  • 13
0
votes
1 answer

Distributed sequential windowed data in pytorch

At every epoch of my training, I need to split my dataset in n batches of t consecutive samples. For example, if my data is [1,2,3,4,5,6,7,8,9,10], n = 2 and t = 3 then valid batches would be [1-2-3, 4-5-6] and [7-8-9, 10-1-2] [2-3-4, 8-9-10] and…
Simon
  • 5,070
  • 5
  • 33
  • 59
0
votes
0 answers

RuntimeError while running get_weights() in strategy.run in tensorflow

I am new to tf.distribute and I do not know how to directly get weights of a model in memory. I put my sample code below, and it gives a RuntimeError. import os import json # Dump the cluster information to `'TF_CONFIG'`. tf_config = { …
0
votes
1 answer

PyTorch Lightning multi node training error on GCP

We are currently working on a project that involves training on Pytorch Lightning. The code utilizes GPUs through DistributedDataParallel (DDP). Currently, it is working fine while running on a single machine of Vertex AI Training job and/or on…
0
votes
1 answer

Sagemaker Distributed Data Parallelism not working as expected ( smdistributed.dataparallel.torch.distributed )

All, I was trying the AWS sagemaker data parallelism approach for the distributed training ( using the two lib ) from smdistributed.dataparallel.torch.parallel.distributed import DistributedDataParallel as DDP import…
0
votes
0 answers

pyotrch distributed: Running shell command

I'm running a distributed pytorch training. Everything works like charm. I am fully utilizing all GPUs, all processes are in sync, everything is fine. At the end of each epoch, I want to run some elaborate evaluation in a new process (not to block…
Shai
  • 111,146
  • 38
  • 238
  • 371