Questions tagged [dataparallel]

15 questions
5
votes
1 answer

Calling functions of a torch.nn.module class wrapped with DataParallel

I have a class A that defines all my networks. I am wrapping this with torch.nn.DataParallel. When I call the forward function as a(), it works fine. However, I also want to call some other functions of A, while still retaining the DataParallel…
Nagabhushan S N
  • 6,407
  • 8
  • 44
  • 87
1
vote
0 answers

Pytorch Multi node training return TCPStore( RuntimeError: Address already in use

I am training a network on 2 machines each machine consists of two GPUS. I have checked the PORT Number to connect both machines to each other but everytime I got an error. How to find the port number? sudo lsof -i :22 | grep LISTEN sshd 2101 …
Khawar Islam
  • 2,556
  • 2
  • 34
  • 56
0
votes
0 answers

How to use pwrite to write files in parallel on Linux by C++?

I'm tring to create several threads to write some data chunks into one file in parallel. Some part of my code is below: void write_thread(float* data, size_t start, size_t end, size_t thread_idx) { auto function_start_time =…
0
votes
0 answers

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 4055352) of

I wanted to use DistributedDataParallel allel to implement the model's single-machine multi-GPU training process, but encountered some problems during the process. The specific implementation code is: def _train_one_epoch(self,epoch): score_AM =…
chihiro
  • 5
  • 4
0
votes
1 answer

DistributedDatapParallel single-machine multi-card implementation with batch

I want to implement running my pytorch model training code on multiple Gpus on a single server. The specific scenario is as follows: The training epochs=2000, the total number of training data episodes for each epoch =1000, there are three GPUs. The…
chihiro
  • 5
  • 4
0
votes
0 answers

how to use Fully Sharded Data Parallel (FSDP) via Seq2SeqTrainer class of hugging face?

I have 2 GTX 1080 Ti GPUs(11G RAM each one) and i want to fine-tune openai/whisper-small model which one of the hugging face transformers models. Also, I want to use Fully Sharded Data Parallel(FSDP) via seq2seqTrainer but i got error. torch…
0
votes
0 answers

Problem of GPU memory duplication across multiple GPUs when disabling data parallelization

I am working on a PyTorch project, and I want to disable data parallelization to ensure that each program runs on a single specified GPU, avoiding memory duplication. I have followed the standard steps of moving the model to the desired GPU device…
0
votes
1 answer

Pytorch nn.DataParallel: RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

I am implementing nn.DataParallel class to utilize multiple GPUs on single machine. I have followed some stack overflow questions and answers but still get a simple error. I have no idea why I am getting this error. Followed Questions RuntimeError:…
Khawar Islam
  • 2,556
  • 2
  • 34
  • 56
0
votes
0 answers

How to use torch.nn.DataParallel if I have more than one network working in tandem?

I have a model as such: netF = timm.create_model(...) #feature extractor netB = network.feat_bottlenect(...) #bottleneck layer netC = network.feat_classifier(...) #classifier layer output = netF(netB(netC(input))) I want to apply…
0
votes
0 answers

Multi Node Training: How to use multiple GPUs on multiple machines in pytorch?

I am working on multiple machines and a single machine consists of two GPUs same as for the second machine. Overall, I have 4 GPUs in two machines. I am following the official example of PyTorch to train imagenet dataset. When I start the training…
Khawar Islam
  • 2,556
  • 2
  • 34
  • 56
0
votes
0 answers

pytorch multiple GPUs: AttributeError: 'list' object has no attribute 'to

I have simply implemented DataParallel technique to utilize multiple GPUs on single machine. I am getting an error in fit function https://github.com/mindee/doctr/blob/main/references/recognition/train_pytorch.py from fastprogress.fastprogress…
Khawar Islam
  • 2,556
  • 2
  • 34
  • 56
0
votes
0 answers

torch.multiprocessing.spawn.ProcessRaisedException: -- Process 0 terminated with the following error:

I am using multiple GPUs on same system to train a network. I have followed all steps mentioned in pytorch documentation. While validation, it give an error regarding -- Process 0 Step 1: import torch.multiprocessing as mp import torch.distributed…
0
votes
0 answers

Aterminate called after throwing an instance of 'std::runtime_error' what(): NCCL Error 1: unhandled cuda error

This error occurs when using DataParallel. but it works when using only 1 GPU. May I ask why this problem occurs and how can I solve it? Aterminate called after throwing an instance of 'std::runtime_error' what(): NCCL Error 1: unhandled cuda…
CHF
  • 9
  • 1
0
votes
1 answer

Parameters can't be updated when using torch.nn.DataParallel to train on multiple GPUs

import torch import torch.nn as nn import os class Net(nn.Module): def __init__(self): super().__init__() self.h = -1 def forward(self, x): self.h =x os.environ['CUDA_VISIBLE_DEVICES'] = '0' if…
hescluke
  • 3
  • 3
0
votes
1 answer

Replacement of var.to(device) in case of nn.DataParallel() in pytorch

Here is a question available but the answer is not relevant. This code will transfer the model to multiple GPUs but how to transfer data on GPU's? if torch.cuda.device_count() > 1: print("Let's use", torch.cuda.device_count(), "GPUs!") #…
Adnan Ali
  • 2,851
  • 5
  • 22
  • 39