Highest Voted 'distributed-training' Questions

0

votes

1 answer

Best Practices for Distributed Training with PyTorch custom containers (BYOC) in SageMaker

What are the best practices for distributed training with PyTorch custom containers (BYOC) in Amazon Sagemaker? I understand that PyTorch framework supports native distributed training or using the Horovod library for PyTorch.

asked Sep 11 '22 at 03:24

juvchan

6,113
2
22
35

0

votes

1 answer

Can Horovod with TensorFlow work on non-GPU instances in Amazon SageMaker?

I want to perform distributed training on Amazon SageMaker. The code is written with TensorFlow and similar to the following code where I think CPU instance should be…

tensorflow amazon-sagemaker distributed-training horovod

asked Sep 11 '22 at 02:50

juvchan

6,113
2
22
35

0

votes

1 answer

How to use multiple instances with the SageMaker XGBoost built-in algorithm?

If we use multiple instances for training will the built-in algorithm automatically exploit it? For example, what if we used 2 instances for training using built-in XGBoost container and we used the same customer churn example? Will one instance be…

amazon-web-services xgboost amazon-sagemaker distributed-training

asked Sep 10 '22 at 18:26

Kyle Gallatin

146
6

0

votes

2 answers

Why does SageMaker PyTorch DDP init times out on SageMaker?

I'm using PyTorch DDP on SageMaker PyTorch Training DLC 1.8.1 The code seems properly DDP-formatted. I'm using instance_count = 2, and launching torch.distributed.launch and I believe the ranks and world size are properly set however…

pytorch amazon-sagemaker distributed-training amz-sagemaker-distributed-training

asked Sep 09 '22 at 19:51

Philipp Schmid

126
7

0

votes

1 answer

Why Does SageMaker Data Parallel Distributed Training Only Support 3 Instances types?

I see here that SageMaker Data Distributed Library only supports 3 instance types: ml.p3.16xlarge, ml.p3dn.24xlarge, ml.p4d.24xlarge. Why is this? I would have thought there might be use cases for parallel training for other GPUs, and even…

amazon-web-services amazon-sagemaker distributed-training amz-sagemaker-distributed-training

asked Sep 09 '22 at 08:33

Philipp Schmid

126
7

0

votes

0 answers

GPU 0 utilization higher than other GPUs on Amazon SageMaker SMDP (distributed Training)

When using SageMaker Data Parallelism (SMDP), my team sees a higher utilization on GPU 0 compared to other GPUs. What can be the likely cause here? Does it have anything to do with the data loader workers that run on CPU? I would expect SMDP to…

amazon-web-services amazon-sagemaker distributed-training amz-sagemaker-distributed-training

asked Sep 08 '22 at 18:21

Philipp Schmid

126
7

0

votes

0 answers

In tensorFlow1.x distributed PS + Worker training, does worker halt each other when doing sess.run()?

This is a general question regarding PS + Workers training paradigm in TensorFlow. Suppose this scenario: 1 PS + 2 Workers are training asynchronizely(suppose they have different training speed) and suppose their graphs are all something like input…

tensorflow deep-learning recommendation-engine distributed-training parameter-server

asked Sep 07 '22 at 13:26

Interfish

11
3

0

votes

1 answer

Data parallelism on multiple GPUs

I am trying to train a model using data parallelism on multiple GPUs on a single machine. As I think, in data parallelism, we divide the data into batches, and then batches are deployed parallel. Afterward, the average gradient is calculated based…

tensorflow deep-learning multi-gpu distributed-training horovod

asked Aug 29 '22 at 03:57

Ahmad

645
2
6
21

0

votes

0 answers

How to train mnist data with tensorflow ParameterServerStrategy distributed training?

I'm trying to train the mnist dataset using the ParameterServerStrategy. As a beginner, I find the documentations to be confusing specially when it comes to the section "Clusters in the real world". This is the docs that I'm…

python numpy tensorflow keras distributed-training

asked May 19 '22 at 23:01

cosmicRover

23
6

0

votes

0 answers

Keras model.fit throws Segmentation Fault with error- libprotobuf FATAL CHECK failed: (value.size()) <= (kint32max)

I am trying to train a simple tensorflow model on emr cluster with around 9000 parameters. But When I try to train the model it throws following error. I tried increasing the memory and decreasing the batch size. But it didn't help. libprotobuf…

tensorflow keras amazon-emr protobuf-c distributed-training

asked Apr 26 '22 at 17:48

Mukul

310
1
6
13

0

votes

1 answer

Distributed sequential windowed data in pytorch

At every epoch of my training, I need to split my dataset in n batches of t consecutive samples. For example, if my data is [1,2,3,4,5,6,7,8,9,10], n = 2 and t = 3 then valid batches would be [1-2-3, 4-5-6] and [7-8-9, 10-1-2] [2-3-4, 8-9-10] and…

pytorch pytorch-dataloader distributed-training

asked Apr 20 '22 at 06:31

Simon

5,070
5
33
59

0

votes

0 answers

RuntimeError while running get_weights() in strategy.run in tensorflow

I am new to tf.distribute and I do not know how to directly get weights of a model in memory. I put my sample code below, and it gives a RuntimeError. import os import json # Dump the cluster information to `'TF_CONFIG'`. tf_config = { …

python tensorflow tensorflow2.0 distributed-training

asked Mar 23 '22 at 07:58

Jacob975

1
1

0

votes

1 answer

PyTorch Lightning multi node training error on GCP

We are currently working on a project that involves training on Pytorch Lightning. The code utilizes GPUs through DistributedDataParallel (DDP). Currently, it is working fine while running on a single machine of Vertex AI Training job and/or on…

python google-cloud-platform ray pytorch-lightning distributed-training

asked Nov 15 '21 at 09:13

Yasser H

1
1

0

votes

1 answer

Sagemaker Distributed Data Parallelism not working as expected ( smdistributed.dataparallel.torch.distributed )

All, I was trying the AWS sagemaker data parallelism approach for the distributed training ( using the two lib ) from smdistributed.dataparallel.torch.parallel.distributed import DistributedDataParallel as DDP import…

amazon-web-services pytorch amazon-sagemaker distributed-training

asked Nov 11 '21 at 07:02

Saurabh Mishra

1

0

votes

0 answers

pyotrch distributed: Running shell command

I'm running a distributed pytorch training. Everything works like charm. I am fully utilizing all GPUs, all processes are in sync, everything is fine. At the end of each epoch, I want to run some elaborate evaluation in a new process (not to block…

python pytorch multiprocessing subprocess distributed-training

asked Jul 25 '21 at 06:43

Shai

111,146
38
238
371

Questions tagged [distributed-training]