Highest Voted 'distributed-training' Questions

1

vote

1 answer

Iterating over `tf.Tensor` is not allowed: AutoGraph did convert this function. This might indicate you are trying to use an unsupported feature

I am trying to adapt this COLA repo to my audio dataset which I have in a local folder. I mainly change file contrastive.py to adapt method _get_ssl_task_data() to my new database. However, I get an error triggered from model.fit (which calls my…

asked Nov 29 '20 at 11:12

Othmane

1,094
2
17
33

1

vote

0 answers

Spark memory usage keeps increasing while training model continues

I am training a U-Net model using TensorFlowOnSpark and a dataset of images that can fit in memory on my Spark cluster which has 3 worker nodes(each one is Ubuntu20 with 11 GB memory). Each node has 1 executor and 4 CPUs provided with 9 GB of…

apache-spark tensorflow pyspark deep-learning distributed-training

asked Aug 20 '20 at 10:02

Orwa kassab

111
1
9

1

vote

0 answers

Modify the ptrace without passing the flag

I'm running some distributed training on some platform using MPI. During the training I saw massive printings like: Read -1, expected 5017600, errno = 1 Read -1, expected 5017600, errno = 1 Read -1, expected 5017600, errno = 1 Read -1, expected…

docker mpi ptrace seccomp distributed-training

asked May 27 '20 at 00:28

user3391299

73
4

1

vote

2 answers

train on multiple devices

I have know that TensorFlow offer Distributed Training API that can train on multiple devices such as multiple GPUs, CPUs, TPUs, or multiple computers ( workers) Follow this doc :…

tensorflow machine-learning distributed-training

asked Apr 10 '20 at 06:12

TMN167

112
1
8

1

vote

0 answers

How to run TensorFlow 2 in a distributed environment with Horovod?

I have successfully set up the distributed environment and run the example with Horovod. And I also know that if I want to run the benchmark on TensorFlow 1 in a distributed setup, e.g. 4 nodes, following the tutorial, the submission should be: $…

tensorflow distributed-training

asked Apr 03 '20 at 15:45

kingwales

23
6

0

votes

0 answers

I have a question while performing distributed training using Horovod (Gloo and MPI)

I have a question while performing distributed training using Horovod. In the results from Gloo and MPI, I noticed that Gloo displays [0] [1] [2] [3] on the left during training, while MPI displays [1,0] [1,1] [1,2] [1,3]. What does this mean? I…

process openmpi distributed-training horovod gloo

asked Aug 07 '23 at 15:44

sykang

1

0

votes

0 answers

How to process large dataset in pytorch DDP mode?

I have a large Dataset about 900G. The memory of my machine is 1T. I want to train a model in distributed training mode. I have 10 gpus. I used to use tensorflow and horovod to make it. I split the dataset into 10 parts. Each process only load part…

pytorch pytorch-dataloader ddp distributed-training

asked Jul 13 '23 at 03:02

haoran.li

11
2

0

votes

0 answers

How to achieve distributed training with CPU on multi-nodes?

I want to distributed training model with CPU on 2 machines. The model training scripts, running command files and time consumed of each machines are as follow: On machine1 (ip: 10.0.0.113): training scripts on machine1: import os import time import…

pytorch openmpi distributed-training

asked Jul 05 '23 at 08:36

Gakki John

1
1

0

votes

0 answers

PyTorch DDP (with Join Context Manager) consuming more power for uneven data distribution

I am using a 2 node distributed setup(each having a single GPU respectively) to train a Neural Network (NN). I utilize PyTorch Data Distributed Parallel with Join Context Manager to achieve this. I am measuring power consumption varying data…

pytorch ddp distributed-training nvidia-smi

asked Jun 11 '23 at 00:38

Monzurul Amin

1

0

votes

1 answer

Unable to train the conformer-rnnt model on tedlium data

I am trying to train the conformer-rnnt model on tedlium data and am encountering the below error when the command to train is executed. usage: run_speech_recognition_rnnt.py [-h] (--manifest MANIFEST | --data_file DATA_FILE) --data_root DATA_ROOT…

python huggingface-transformers transformer-model distributed-training

asked May 31 '23 at 00:45

moonface16

5
1
3

0

votes

0 answers

pytorch DDP using torchrun

I'm trying to play with pytorch ddp using torchrun. However, the script always crash at the line with the first # FIXME. The file uses an IMDB dataset to do text classification. Code: # newer command: CUDA_LAUNCH_BLOCKING=1 torchrun --standalone…

pytorch distributed-training

asked Apr 16 '23 at 22:27

Will ---

19
3

0

votes

0 answers

Tensorflow is not listing my dedicated GPU

I have a dedicated GPU installed in my device. I want to use it for deep learning model training. I followed many tutorial related to setting up the tensorflow-gpu but no one worked for me. Please guide me and provide a proper step-by-step process…

python tensorflow deep-learning gpu distributed-training

asked Apr 06 '23 at 19:35

Abhinav Singh

9
2

0

votes

0 answers

Time and cost to train Distill GPT-2 model on BookCorpus using AWS EC2

I am trying to calculate the time it would take to train a Distill GPT2 model on BookCorpus dataset using multiple EC2 instances for the purpose of language modeling. What is the method for calculating training time of language models?

amazon-ec2 nlp gpt-2 distributed-training nlg

asked Mar 28 '23 at 06:40

Troi

43
2

0

votes

0 answers

Turn off Distributed Training

I was working on a project that involves captioning. I wanted to use a model I found on github to run inferences. But the problem is in the main file they used distributed training to train on multiple gpus and I have only…

deep-learning pytorch gpu caption distributed-training

asked Mar 23 '23 at 15:34

Sagnnik Biswas

15
2

0

votes

0 answers

Stacked vs. eponymous torchrun cli options

Docs here: https://pytorch.org/docs/stable/elastic/run.html#single-node-multi-worker In the Pytorch docs for torchrun, it lists two options for single-node multi-worker training: “Single-node multi-worker” and “Stacked single-node multi-worker”. For…

deep-learning pytorch distributed-training

asked Mar 20 '23 at 16:42

Rob

1
1

Questions tagged [distributed-training]