Questions tagged [distributed-training]

83 questions
0
votes
0 answers

ncclInternalError: Internal check failed. Proxy Call to rank 0 failed (Connect)

ncclInternalError: Internal check failed. Proxy Call to rank 0 failed (Connect) After setting up ray cluster with 2 nodes of single gpu & also direct pytroch distributed run … with the same nodes i got my distributed process registered. starting…
0
votes
0 answers

Training a model on multiple GPU is very slow

I want to train a model on multiple GPU one node but with the following code trategy = tf.distribute.MirroredStrategy() print("Number of devices: {}".format(strategy.num_replicas_in_sync)) # Open a strategy scope. with strategy.scope(): #…
Akbari
  • 31
  • 2
0
votes
0 answers

Pytorch Lightening not using all resources

I am running the lab 1 example as it is. Everything goes fine and training succeeds. But when I check the training logs, Its is all happening on [1,mpirank:0,algo-1]. I am passing the instance_count as two and can see there are two hosts [algo-1 and…
0
votes
0 answers

Some question about distributed training in pytorch

I want to use mmcv for distributed training. However, I had some problems. # Copyright (c) OpenMMLab. All rights reserved. import os.path as osp import platform import shutil import time import warnings from typing import Callable, Dict, List,…
LOTEAT
  • 181
  • 1
  • 4
0
votes
0 answers

This torch DDP Training Script is executing just the first epoch and stops thereafter

I am currently working on porting an existing (and working) training script that I wrote on a multi GPU machine. I encounter following problem. The code does detect all 8 GPUs (I am using torchrun for executing the file) and does the first epoch as…
0
votes
1 answer

Is there a way to use distributed training with DASK using my GPU?

As of now, lightGBM model supports GPU training and distributed training (using DASK). If it is possible, how can I use distributed training with DASK using my GPU or is there any other way to do so? Actually my task is to use the power of GPU and…
0
votes
1 answer

How to merge model from distributed training

Here is my code for distributed training via spark-tensorflow-distributor that uses tensorflow MultiWorkerMirroredStrategy to train using multiple…
olaf
  • 239
  • 1
  • 8
0
votes
1 answer

How does tensorflow MultiWorkerMirroredStrategy work during autoscaling and failure if you have to configure cluster_resolver?

It seems like I have to configure cluster_resolver before running training to enable distributed training on multiple worker But how does that work with autoscaling and node…
olaf
  • 239
  • 1
  • 8
0
votes
1 answer

RuntimeError: Method requires being in cross-replica context, use get_replica_context().merge_call()

I try to modify a running CycleGAN from SingleGPU to tf.distribute.MirroredStrategy. Having tried several things like custom training loops, Question of jongsung park, adjustments after the Tensorflow Tutorial and several places of strategy.scope().…
0
votes
0 answers

SageMaker Distributed Data Parallel Library (SMDDP) runtime error in BYOC TensorFlow 2.x environment

I am setting up a custom container (BYOC) for distributed training in TensorFlow 2.x using SageMaker Distributed Data Parallel Library (SMDDP), but got the following runtime error importing smdistributed.dataparallel.tensorflow RuntimeError:…
0
votes
1 answer

Accelerate BERT training with HuggingFace Model Parallelism

I am currently using SageMaker to train BERT and trying to improve the BERT training time. I use PyTorch and Huggingface on AWS g4dn.12xlarge instance type. However when I run parallel training it is far from achieving linear improvement. I'm…
0
votes
1 answer

Distributed Unsupervised Learning in SageMaker

I am running local unsupervised learning (predominantly clustering) on a large, single node with GPU. Does SageMaker support distributed unsupervised learning using clustering? If yes, please provide the relevant example (preferably non-TensorFlow).
0
votes
3 answers

Does SageMaker built-in LightGBM algorithm support distributed training?

Does Amazon SageMaker built-in LightGBM algorithm support distributed training? I use Databricks for distributed training of LightGBM today. If SageMaker built-in LightGBM supports distributed training, I would consider migrating to SageMaker. It…
0
votes
1 answer

How to run SageMaker Distributed training from SageMaker Studio?

The sample notebooks for SageMaker Distributed training, like here: https://github.com/aws/amazon-sagemaker-examples/blob/main/advanced_functionality/distributed_tensorflow_mask_rcnn/mask-rcnn-scriptmode-s3.ipynb rely on the docker build .…
0
votes
1 answer

Using GPU Spot Instance(s) for SageMaker Distributed Training?

I have a requirement to use N 1x GPU Spot instances instead of 1x N-GPU instance for distributed training. Does SageMaker Distributed Training support the use of GPU Spot instance(s)? If yes, how to enable it?