Questions tagged [distributed-training]
83 questions
1
vote
1 answer
Iterating over `tf.Tensor` is not allowed: AutoGraph did convert this function. This might indicate you are trying to use an unsupported feature
I am trying to adapt this COLA repo to my audio dataset which I have in a local folder. I mainly change file contrastive.py to adapt method _get_ssl_task_data() to my new database.
However, I get an error triggered from model.fit (which calls my…

Othmane
- 1,094
- 2
- 17
- 33
1
vote
0 answers
Spark memory usage keeps increasing while training model continues
I am training a U-Net model using TensorFlowOnSpark and a dataset of images that can fit in memory on my Spark cluster which has 3 worker nodes(each one is Ubuntu20 with 11 GB memory). Each node has 1 executor and 4 CPUs provided with 9 GB of…

Orwa kassab
- 111
- 1
- 9
1
vote
0 answers
Modify the ptrace without passing the flag
I'm running some distributed training on some platform using MPI. During the training I saw massive printings like:
Read -1, expected 5017600, errno = 1
Read -1, expected 5017600, errno = 1
Read -1, expected 5017600, errno = 1
Read -1, expected…

user3391299
- 73
- 4
1
vote
2 answers
train on multiple devices
I have know that TensorFlow offer Distributed Training API that can train on multiple devices such as multiple GPUs, CPUs, TPUs, or multiple computers ( workers)
Follow this doc :…

TMN167
- 112
- 1
- 8
1
vote
0 answers
How to run TensorFlow 2 in a distributed environment with Horovod?
I have successfully set up the distributed environment and run the example with Horovod. And I also know that if I want to run the benchmark on TensorFlow 1 in a distributed setup, e.g. 4 nodes, following the tutorial, the submission should be:
$…

kingwales
- 23
- 6
0
votes
0 answers
I have a question while performing distributed training using Horovod (Gloo and MPI)
I have a question while performing distributed training using Horovod. In the results from Gloo and MPI, I noticed that Gloo displays [0] [1] [2] [3] on the left during training, while MPI displays [1,0] [1,1] [1,2] [1,3]. What does this mean? I…

sykang
- 1
0
votes
0 answers
How to process large dataset in pytorch DDP mode?
I have a large Dataset about 900G. The memory of my machine is 1T. I want to train a model in distributed training mode. I have 10 gpus. I used to use tensorflow and horovod to make it. I split the dataset into 10 parts. Each process only load part…

haoran.li
- 11
- 2
0
votes
0 answers
How to achieve distributed training with CPU on multi-nodes?
I want to distributed training model with CPU on 2 machines. The model training scripts, running command files and time consumed of each machines are as follow:
On machine1 (ip: 10.0.0.113):
training scripts on machine1:
import os
import time
import…

Gakki John
- 1
- 1
0
votes
0 answers
PyTorch DDP (with Join Context Manager) consuming more power for uneven data distribution
I am using a 2 node distributed setup(each having a single GPU respectively) to train a Neural Network (NN). I utilize PyTorch Data Distributed Parallel with Join Context Manager to achieve this. I am measuring power consumption varying data…
0
votes
1 answer
Unable to train the conformer-rnnt model on tedlium data
I am trying to train the conformer-rnnt model on tedlium data and am encountering the below error when the command to train is executed.
usage: run_speech_recognition_rnnt.py [-h] (--manifest MANIFEST | --data_file DATA_FILE) --data_root DATA_ROOT…

moonface16
- 5
- 1
- 3
0
votes
0 answers
pytorch DDP using torchrun
I'm trying to play with pytorch ddp using torchrun. However, the script always crash at the line with the first # FIXME. The file uses an IMDB dataset to do text classification.
Code:
# newer command: CUDA_LAUNCH_BLOCKING=1 torchrun --standalone…

Will ---
- 19
- 3
0
votes
0 answers
Tensorflow is not listing my dedicated GPU
I have a dedicated GPU installed in my device. I want to use it for deep learning model training. I followed many tutorial related to setting up the tensorflow-gpu but no one worked for me.
Please guide me and provide a proper step-by-step process…

Abhinav Singh
- 9
- 2
0
votes
0 answers
Time and cost to train Distill GPT-2 model on BookCorpus using AWS EC2
I am trying to calculate the time it would take to train a Distill GPT2 model on BookCorpus dataset using multiple EC2 instances for the purpose of language modeling.
What is the method for calculating training time of language models?

Troi
- 43
- 2
0
votes
0 answers
Turn off Distributed Training
I was working on a project that involves captioning. I wanted to use a model I found on github to run inferences. But the problem is in the main file they used distributed training to train on multiple gpus and I have only…

Sagnnik Biswas
- 15
- 2
0
votes
0 answers
Stacked vs. eponymous torchrun cli options
Docs here: https://pytorch.org/docs/stable/elastic/run.html#single-node-multi-worker
In the Pytorch docs for torchrun, it lists two options for single-node multi-worker training: “Single-node multi-worker” and “Stacked single-node multi-worker”.
For…

Rob
- 1
- 1