Questions tagged [dataparallel]
15 questions
5
votes
1 answer
Calling functions of a torch.nn.module class wrapped with DataParallel
I have a class A that defines all my networks. I am wrapping this with torch.nn.DataParallel. When I call the forward function as a(), it works fine. However, I also want to call some other functions of A, while still retaining the DataParallel…

Nagabhushan S N
- 6,407
- 8
- 44
- 87
1
vote
0 answers
Pytorch Multi node training return TCPStore( RuntimeError: Address already in use
I am training a network on 2 machines each machine consists of two GPUS. I have checked the PORT Number to connect both machines to each other but everytime I got an error.
How to find the port number? sudo lsof -i :22 | grep LISTEN
sshd 2101 …

Khawar Islam
- 2,556
- 2
- 34
- 56
0
votes
0 answers
How to use pwrite to write files in parallel on Linux by C++?
I'm tring to create several threads to write some data chunks into one file in parallel.
Some part of my code is below:
void write_thread(float* data, size_t start, size_t end, size_t thread_idx) {
auto function_start_time =…

Jerry
- 23
- 5
0
votes
0 answers
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 4055352) of
I wanted to use DistributedDataParallel allel to implement the model's single-machine multi-GPU training process, but encountered some problems during the process.
The specific implementation code is:
def _train_one_epoch(self,epoch):
score_AM =…

chihiro
- 5
- 4
0
votes
1 answer
DistributedDatapParallel single-machine multi-card implementation with batch
I want to implement running my pytorch model training code on multiple Gpus on a single server.
The specific scenario is as follows:
The training epochs=2000, the total number of training data episodes for each epoch =1000, there are three GPUs. The…

chihiro
- 5
- 4
0
votes
0 answers
how to use Fully Sharded Data Parallel (FSDP) via Seq2SeqTrainer class of hugging face?
I have 2 GTX 1080 Ti GPUs(11G RAM each one) and i want to fine-tune openai/whisper-small model which one of the hugging face transformers models. Also, I want to use Fully Sharded Data Parallel(FSDP) via seq2seqTrainer but i got error.
torch…

vafa knm
- 1
0
votes
0 answers
Problem of GPU memory duplication across multiple GPUs when disabling data parallelization
I am working on a PyTorch project, and I want to disable data parallelization to ensure that each program runs on a single specified GPU, avoiding memory duplication. I have followed the standard steps of moving the model to the desired GPU device…

RiverFlows
- 31
- 4
0
votes
1 answer
Pytorch nn.DataParallel: RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same
I am implementing nn.DataParallel class to utilize multiple GPUs on single machine. I have followed some stack overflow questions and answers but still get a simple error. I have no idea why I am getting this error.
Followed Questions
RuntimeError:…

Khawar Islam
- 2,556
- 2
- 34
- 56
0
votes
0 answers
How to use torch.nn.DataParallel if I have more than one network working in tandem?
I have a model as such:
netF = timm.create_model(...) #feature extractor
netB = network.feat_bottlenect(...) #bottleneck layer
netC = network.feat_classifier(...) #classifier layer
output = netF(netB(netC(input)))
I want to apply…

Sadman Jahan
- 11
- 2
0
votes
0 answers
Multi Node Training: How to use multiple GPUs on multiple machines in pytorch?
I am working on multiple machines and a single machine consists of two GPUs same as for the second machine. Overall, I have 4 GPUs in two machines. I am following the official example of PyTorch to train imagenet dataset. When I start the training…

Khawar Islam
- 2,556
- 2
- 34
- 56
0
votes
0 answers
pytorch multiple GPUs: AttributeError: 'list' object has no attribute 'to
I have simply implemented DataParallel technique to utilize multiple GPUs on single machine. I am getting an error in fit function
https://github.com/mindee/doctr/blob/main/references/recognition/train_pytorch.py
from fastprogress.fastprogress…

Khawar Islam
- 2,556
- 2
- 34
- 56
0
votes
0 answers
torch.multiprocessing.spawn.ProcessRaisedException: -- Process 0 terminated with the following error:
I am using multiple GPUs on same system to train a network. I have followed all steps mentioned in pytorch documentation. While validation, it give an error regarding -- Process 0
Step 1:
import torch.multiprocessing as mp
import torch.distributed…

Khawar Islam
- 2,556
- 2
- 34
- 56
0
votes
0 answers
Aterminate called after throwing an instance of 'std::runtime_error' what(): NCCL Error 1: unhandled cuda error
This error occurs when using DataParallel.
but it works when using only 1 GPU.
May I ask why this problem occurs and how can I solve it?
Aterminate called after throwing an instance of 'std::runtime_error'
what(): NCCL Error 1: unhandled cuda…

CHF
- 9
- 1
0
votes
1 answer
Parameters can't be updated when using torch.nn.DataParallel to train on multiple GPUs
import torch
import torch.nn as nn
import os
class Net(nn.Module):
def __init__(self):
super().__init__()
self.h = -1
def forward(self, x):
self.h =x
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
if…

hescluke
- 3
- 3
0
votes
1 answer
Replacement of var.to(device) in case of nn.DataParallel() in pytorch
Here is a question available but the answer is not relevant.
This code will transfer the model to multiple GPUs but how to transfer data on GPU's?
if torch.cuda.device_count() > 1:
print("Let's use", torch.cuda.device_count(), "GPUs!")
#…

Adnan Ali
- 2,851
- 5
- 22
- 39