I trained a faster r cnn in order to detect tools. I already define my model and every thing worked. But to have a cleaner code without gloabal variables I tried to write a class MyModel who will automatically define every objet and train the model. So on this class I defined a class called self.dataset = ToolDataset.
On this first class I have defined my input (an image) and my output (a target which is a dictionnary with bboxes, labels, area …).
Then I built a data loader (so I have a self.data_loader), and I used the function train_one_epoch of the engine librarie. On this function, I gave in input my model (a faster r cnn), my data loader, and the device who is cuda:0 (I printed it). This function iterate on my data loader. This function defines a list of images and a list of targets, and converts the values of the lists into the good device.
And then it calls : model(images, targets)
. And on this step I got the error with the two devices founded (I pasted the error at the end of the message).
I got the error even if every tensor (my images, and every values of my target dictionary) returned True for the command tensor.is_cuda. So I really don’t understand why does the error say that I have also a cpu device. I show you my function train , train_one_epoch, and my variables images and targets :
train method :
def train(self, num_epoch = 10, gpu = True):
if gpu :
CUDA_LAUNCH_BLOCKING="1"
#torch.set_default_tensor_type(torch.FloatTensor)
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")
model.to(device)
if self.multi_object_detection == False :
num_classes = 2 # ['Tool', 'background']
else :
print("need to set a multi object detection code")
in_features = torch.tensor(model.roi_heads.box_predictor.cls_score.in_features, dtype = torch.int64).to(device)
print("in_features = {}".format(in_features))
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
print( "model.roi_heads.box_predictor {}".format( model.roi_heads.box_predictor))
model_parameters = filter(lambda p: p.requires_grad, model.parameters())
#params = sum([np.prod(p.size()) for p in model_parameters])
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=0.001, momentum=0.9, weight_decay=0.0005)
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.5)
gc.collect()
num_epochs = 5
FILE_model_dict_gpu = "model_state_dict__gpu_lab2_and_lab7_5epoch.pth"
list_of_list_losses = []
print("device = ", device)
if (self.data_loader.dataset) == None :
self.build_dataloader(device)
for epoch in tqdm(range(num_epochs)):
# Train for one epoch, printing every 10 iterations
train_his_, list_losses, list_losses_dict = train_one_epoch(model, optimizer, self.data_loader, device, epoch, print_freq=10)
list_of_list_losses.append(list_losses)
# Compute losses over the validation set
#val_his_ = validate_one_epoch(model, val_data_loader, device, print_freq=10)
# Update the learning rate
print("lr before update : ", lr_scheduler)
lr_scheduler.step()
print("lr after update : ", lr_scheduler)
# Store loss values to plot learning curves afterwork.
if epoch == 0:
train_history = {k: [v] for k, v in train_his_.items()}
#val_history = {k: [v] for k, v in val_his_.items()}
else:
for k, v in train_his_.items():train_history[k] += [v]
# for k, v in val_his_.items():val_history[k] += [v]
# On peut save le modèle dans la boucle en ajoutant un critère : si la validation decroit
# torch.save(model, save_path)
torch.cuda.empty_cache()
gc.collect()
train_one_epoch function (I print some information that will be shown on the output at the end of the message)
def train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq):
model.train()
metric_logger = utilss.MetricLogger(delimiter=" ")
metric_logger.add_meter('lr', utilss.SmoothedValue(window_size=1, fmt='{value:.6f}'))
header = 'Epoch: [{}]'.format(epoch)
list_losses = []
list_losses_dict = []
for i, values in tqdm(enumerate(metric_logger.log_every(data_loader, print_freq, header))):
images, targets = values
for image in images :
print("before the to(device) operation, image.is_cuda = {}".format(image.is_cuda))
images = list(image.to(device, dtype=torch.float) for image in images)
targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
#images = [image.cuda() for image in images]
for image in images :
print(image)
print("after the to(device) operation, image.is_cuda = {}".format(image.is_cuda))
for target in targets :
for t, dict_value in target.items():
print("after the to(device) operation, dict_value.is_cuda = {}".format(dict_value.is_cuda))
print("images = {}".format(images))
print("targets = {}".format(targets))
# Feed the training samples to the model and compute the losses
loss_dict = model(images, targets)
losses = sum(loss for loss in loss_dict.values())
# reduce losses over all GPUs for logging purposes
loss_dict_reduced = utilss.reduce_dict(loss_dict)
losses_reduced = sum(loss for loss in loss_dict_reduced.values())
loss_value = losses_reduced.item()
print("Loss is {}, stopping training".format(loss_value))
if not math.isfinite(loss_value):
print("Loss is {}, stopping training".format(loss_value))
print(loss_dict_reduced)
sys.exit(1)
list_losses.append(loss_value)
# Pytorch function to initialize optimizer
optimizer.zero_grad()
# Compute gradients or the backpropagation
losses.backward()
# Update current gradient
optimizer.step()
And I show you my output with the error (with my images and target, and the error) :
in_features = 1024
model.roi_heads.box_predictor FastRCNNPredictor(
(cls_score): Linear(in_features=1024, out_features=2, bias=True)
(bbox_pred): Linear(in_features=1024, out_features=8, bias=True)
)
device = cuda:0
100%|██████████| 515/515 [00:00<00:00, 112118.06it/s]
100%|██████████| 761/761 [00:00<00:00, 111005.96it/s]
0%| | 0/5 [00:00<?, ?it/s]
0it [00:00, ?it/s]
before the to(device) operation, image.is_cuda = True
tensor([[[0.0078, 0.0078, 0.0078, ..., 0.0000, 0.0000, 0.0000],
[0.0078, 0.0078, 0.0078, ..., 0.0000, 0.0000, 0.0000],
[0.0078, 0.0078, 0.0078, ..., 0.0000, 0.0000, 0.0000],
...,
[0.0078, 0.0078, 0.0078, ..., 0.0118, 0.0118, 0.0118],
[0.0235, 0.0235, 0.0235, ..., 0.0235, 0.0235, 0.0235],
[0.0353, 0.0353, 0.0353, ..., 0.0314, 0.0314, 0.0314]],
[[0.0078, 0.0078, 0.0078, ..., 0.0000, 0.0000, 0.0000],
[0.0078, 0.0078, 0.0078, ..., 0.0000, 0.0000, 0.0000],
[0.0078, 0.0078, 0.0078, ..., 0.0000, 0.0000, 0.0000],
...,
[0.0078, 0.0078, 0.0078, ..., 0.0039, 0.0039, 0.0039],
[0.0235, 0.0235, 0.0235, ..., 0.0157, 0.0157, 0.0157],
[0.0353, 0.0353, 0.0353, ..., 0.0235, 0.0235, 0.0235]],
[[0.0078, 0.0078, 0.0078, ..., 0.0118, 0.0118, 0.0118],
[0.0078, 0.0078, 0.0078, ..., 0.0118, 0.0118, 0.0118],
[0.0078, 0.0078, 0.0078, ..., 0.0118, 0.0118, 0.0118],
...,
[0.0078, 0.0078, 0.0078, ..., 0.0078, 0.0078, 0.0078],
[0.0235, 0.0235, 0.0235, ..., 0.0196, 0.0196, 0.0196],
[0.0353, 0.0353, 0.0353, ..., 0.0275, 0.0275, 0.0275]]],
device='cuda:0')
after the to(device) operation, image.is_cuda = True
after the to(device) operation, dict_value.is_cuda = True
after the to(device) operation, dict_value.is_cuda = True
after the to(device) operation, dict_value.is_cuda = True
after the to(device) operation, dict_value.is_cuda = True
after the to(device) operation, dict_value.is_cuda = True
images = [tensor([[[0.0078, 0.0078, 0.0078, ..., 0.0000, 0.0000, 0.0000],
[0.0078, 0.0078, 0.0078, ..., 0.0000, 0.0000, 0.0000],
[0.0078, 0.0078, 0.0078, ..., 0.0000, 0.0000, 0.0000],
...,
[0.0078, 0.0078, 0.0078, ..., 0.0118, 0.0118, 0.0118],
[0.0235, 0.0235, 0.0235, ..., 0.0235, 0.0235, 0.0235],
[0.0353, 0.0353, 0.0353, ..., 0.0314, 0.0314, 0.0314]],
[[0.0078, 0.0078, 0.0078, ..., 0.0000, 0.0000, 0.0000],
[0.0078, 0.0078, 0.0078, ..., 0.0000, 0.0000, 0.0000],
[0.0078, 0.0078, 0.0078, ..., 0.0000, 0.0000, 0.0000],
...,
[0.0078, 0.0078, 0.0078, ..., 0.0039, 0.0039, 0.0039],
[0.0235, 0.0235, 0.0235, ..., 0.0157, 0.0157, 0.0157],
[0.0353, 0.0353, 0.0353, ..., 0.0235, 0.0235, 0.0235]],
[[0.0078, 0.0078, 0.0078, ..., 0.0118, 0.0118, 0.0118],
[0.0078, 0.0078, 0.0078, ..., 0.0118, 0.0118, 0.0118],
[0.0078, 0.0078, 0.0078, ..., 0.0118, 0.0118, 0.0118],
...,
[0.0078, 0.0078, 0.0078, ..., 0.0078, 0.0078, 0.0078],
[0.0235, 0.0235, 0.0235, ..., 0.0196, 0.0196, 0.0196],
[0.0353, 0.0353, 0.0353, ..., 0.0275, 0.0275, 0.0275]]],
device='cuda:0')]
targets = [{'boxes': tensor([[1118.8964, 0.0000, 1368.9186, 399.3243],
[1043.0958, 111.4863, 1332.4319, 426.1295]], device='cuda:0',
dtype=torch.float64), 'labels': tensor([1, 1], device='cuda:0'), 'index': tensor([311], device='cuda:0'), 'area': tensor([99839.9404, 91037.6485], device='cuda:0', dtype=torch.float64), 'iscrowd': tensor([0], device='cuda:0')}]
/home/nathaneberrebi/anaconda3/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /opt/conda/conda-bld/pytorch_1623448278899/work/c10/core/TensorImpl.h:1156.)
return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
0it [00:02, ?it/s]
0%| | 0/5 [00:02<?, ?it/s]
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-15-51a35da5b1fe> in <module>
----> 1 class_model.train()
<ipython-input-7-d44d099a7743> in train(self, num_epoch, gpu)
144
145 # Train for one epoch, printing every 10 iterations
--> 146 train_his_, list_losses, list_losses_dict = train_one_epoch(model, optimizer, self.data_loader, device, epoch, print_freq=10)
147 list_of_list_losses.append(list_losses)
148 # Compute losses over the validation set
<ipython-input-6-347c12a81a2f> in train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq)
519
520 # Feed the training samples to the model and compute the losses
--> 521 loss_dict = model(images, targets)
522 losses = sum(loss for loss in loss_dict.values())
523
~/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
1049 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1050 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051 return forward_call(*input, **kwargs)
1052 # Do not call functions when jit is used
1053 full_backward_hooks, non_full_backward_hooks = [], []
~/anaconda3/lib/python3.8/site-packages/torchvision/models/detection/generalized_rcnn.py in forward(self, images, targets)
95 features = OrderedDict([('0', features)])
96 proposals, proposal_losses = self.rpn(images, features, targets)
---> 97 detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
98 detections = self.transform.postprocess(detections, images.image_sizes, original_image_sizes)
99
~/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
1049 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1050 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051 return forward_call(*input, **kwargs)
1052 # Do not call functions when jit is used
1053 full_backward_hooks, non_full_backward_hooks = [], []
~/anaconda3/lib/python3.8/site-packages/torchvision/models/detection/roi_heads.py in forward(self, features, proposals, image_shapes, targets)
752 box_features = self.box_roi_pool(features, proposals, image_shapes)
753 box_features = self.box_head(box_features)
--> 754 class_logits, box_regression = self.box_predictor(box_features)
755
756 result: List[Dict[str, torch.Tensor]] = []
~/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
1049 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1050 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051 return forward_call(*input, **kwargs)
1052 # Do not call functions when jit is used
1053 full_backward_hooks, non_full_backward_hooks = [], []
~/anaconda3/lib/python3.8/site-packages/torchvision/models/detection/faster_rcnn.py in forward(self, x)
280 assert list(x.shape[2:]) == [1, 1]
281 x = x.flatten(start_dim=1)
--> 282 scores = self.cls_score(x)
283 bbox_deltas = self.bbox_pred(x)
284
~/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
1049 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1050 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051 return forward_call(*input, **kwargs)
1052 # Do not call functions when jit is used
1053 full_backward_hooks, non_full_backward_hooks = [], []
~/anaconda3/lib/python3.8/site-packages/torch/nn/modules/linear.py in forward(self, input)
94
95 def forward(self, input: Tensor) -> Tensor:
---> 96 return F.linear(input, self.weight, self.bias)
97
98 def extra_repr(self) -> str:
~/anaconda3/lib/python3.8/site-packages/torch/nn/functional.py in linear(input, weight, bias)
1845 if has_torch_function_variadic(input, weight):
1846 return handle_torch_function(linear, (input, weight), input, weight, bias=bias)
-> 1847 return torch._C._nn.linear(input, weight, bias)
1848
1849
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking arugment for argument mat1 in method wrapper_addmm)
Thank you very much for your help, I'm having this issue since while. And I cannot torch.jit.trace my last model (before trying to clean my code using a class to build automatically every object with just one function train) because of the same error. And I need to fix it to use this model in a c++ code. Let me know if you need further informations.
Here is my toch env :
PyTorch version: 1.9.0
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31
Python version: 3.8 (64-bit runtime)
Python platform: Linux-5.8.0-59-generic-x86_64-with-glibc2.10
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: GeForce RTX 3060 Laptop GPU
Nvidia driver version: 460.80
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.20.2
[pip3] numpydoc==1.1.0
[pip3] torch==1.9.0
[pip3] torchaudio==0.9.0a0+33b2469
[pip3] torchvision==0.10.0
[conda] blas 1.0 mkl
[conda] cudatoolkit 11.1.74 h6bb024c_0 nvidia
[conda] mkl 2021.2.0 h06a4308_296
[conda] mkl-service 2.4.0 py38h497a2fe_0 conda-forge
[conda] mkl_fft 1.3.0 py38h42c9631_2
[conda] mkl_random 1.2.2 py38h1abd341_0 conda-forge
[conda] numpy 1.18.5 pypi_0 pypi
[conda] numpy-base 1.20.2 py38hfae3a4d_0
[conda] numpydoc 1.1.0 py_1 conda-forge
[conda] pytorch 1.9.0 py3.8_cuda11.1_cudnn8.0.5_0 pytorch
[conda] torch 1.9.0 pypi_0 pypi
[conda] torchaudio 0.9.0 py38 pytorch
[conda] torchvision 0.10.0 py38_cu111 pytorch