Questions tagged [distributed-training]
83 questions
0
votes
0 answers
ncclInternalError: Internal check failed. Proxy Call to rank 0 failed (Connect)
ncclInternalError: Internal check failed. Proxy Call to rank 0 failed (Connect)
After setting up ray cluster with 2 nodes of single gpu & also direct pytroch distributed run … with the same nodes i got my distributed process registered. starting…

NavinKumarmMNK
- 883
- 3
- 7
0
votes
0 answers
Training a model on multiple GPU is very slow
I want to train a model on multiple GPU one node but with the following code
trategy = tf.distribute.MirroredStrategy()
print("Number of devices: {}".format(strategy.num_replicas_in_sync))
# Open a strategy scope.
with strategy.scope():
#…

Akbari
- 31
- 2
0
votes
0 answers
Pytorch Lightening not using all resources
I am running the lab 1 example as it is. Everything goes fine and training succeeds. But when I check the training logs, Its is all happening on [1,mpirank:0,algo-1]. I am passing the instance_count as two and can see there are two hosts [algo-1 and…

souraj
- 13
- 2
0
votes
0 answers
Some question about distributed training in pytorch
I want to use mmcv for distributed training. However, I had some problems.
# Copyright (c) OpenMMLab. All rights reserved.
import os.path as osp
import platform
import shutil
import time
import warnings
from typing import Callable, Dict, List,…

LOTEAT
- 181
- 1
- 4
0
votes
0 answers
This torch DDP Training Script is executing just the first epoch and stops thereafter
I am currently working on porting an existing (and working) training script that I wrote on a multi GPU machine. I encounter following problem. The code does detect all 8 GPUs (I am using torchrun for executing the file) and does the first epoch as…

Nisse97
- 1
0
votes
1 answer
Is there a way to use distributed training with DASK using my GPU?
As of now, lightGBM model supports GPU training and distributed training (using DASK).
If it is possible, how can I use distributed training with DASK using my GPU or is there any other way to do so?
Actually my task is to use the power of GPU and…

Sumitram Kumar
- 23
- 7
0
votes
1 answer
How to merge model from distributed training
Here is my code for distributed training via spark-tensorflow-distributor that uses tensorflow MultiWorkerMirroredStrategy to train using multiple…

olaf
- 239
- 1
- 8
0
votes
1 answer
How does tensorflow MultiWorkerMirroredStrategy work during autoscaling and failure if you have to configure cluster_resolver?
It seems like I have to configure cluster_resolver before running training to enable distributed training on multiple worker
But how does that work with autoscaling and node…

olaf
- 239
- 1
- 8
0
votes
1 answer
RuntimeError: Method requires being in cross-replica context, use get_replica_context().merge_call()
I try to modify a running CycleGAN from SingleGPU to tf.distribute.MirroredStrategy.
Having tried several things like custom training loops, Question of jongsung park, adjustments after the Tensorflow Tutorial and several places of strategy.scope().…

Florian
- 11
- 1
- 3
0
votes
0 answers
SageMaker Distributed Data Parallel Library (SMDDP) runtime error in BYOC TensorFlow 2.x environment
I am setting up a custom container (BYOC) for distributed training in TensorFlow 2.x using SageMaker Distributed Data Parallel Library (SMDDP), but got the following runtime error importing smdistributed.dataparallel.tensorflow
RuntimeError:…

juvchan
- 6,113
- 2
- 22
- 35
0
votes
1 answer
Accelerate BERT training with HuggingFace Model Parallelism
I am currently using SageMaker to train BERT and trying to improve the BERT training time. I use PyTorch and Huggingface on AWS g4dn.12xlarge instance type.
However when I run parallel training it is far from achieving linear improvement. I'm…

juvchan
- 6,113
- 2
- 22
- 35
0
votes
1 answer
Distributed Unsupervised Learning in SageMaker
I am running local unsupervised learning (predominantly clustering) on a large, single node with GPU.
Does SageMaker support distributed unsupervised learning using clustering?
If yes, please provide the relevant example (preferably non-TensorFlow).

juvchan
- 6,113
- 2
- 22
- 35
0
votes
3 answers
Does SageMaker built-in LightGBM algorithm support distributed training?
Does Amazon SageMaker built-in LightGBM algorithm support distributed training?
I use Databricks for distributed training of LightGBM today. If SageMaker built-in LightGBM supports distributed training, I would consider migrating to SageMaker. It…

juvchan
- 6,113
- 2
- 22
- 35
0
votes
1 answer
How to run SageMaker Distributed training from SageMaker Studio?
The sample notebooks for SageMaker Distributed training, like here: https://github.com/aws/amazon-sagemaker-examples/blob/main/advanced_functionality/distributed_tensorflow_mask_rcnn/mask-rcnn-scriptmode-s3.ipynb rely on the docker build .…

juvchan
- 6,113
- 2
- 22
- 35
0
votes
1 answer
Using GPU Spot Instance(s) for SageMaker Distributed Training?
I have a requirement to use N 1x GPU Spot instances instead of 1x N-GPU instance for distributed training.
Does SageMaker Distributed Training support the use of GPU Spot instance(s)? If yes, how to enable it?

juvchan
- 6,113
- 2
- 22
- 35