CUDNN Error in backprop for big batches

Question

I implemented a combination of MLP, RNN, CNN. With a batch size of 420, everything seems to work fine (aka I dont get any errors). However as soon as I increase the batch to 840, I receive the following error:

Traceback (most recent call last):
  File "train_cnn_rnn.py", line 152, in <module>
    loss.backward()
  File "/home/tbaumgae/.local/lib/python3.5/site-packages/torch/autograd/variable.py", line 146, in backward
    self._execution_engine.run_backward((self,), (gradient,), retain_variables)
RuntimeError: CUDNN_STATUS_NOT_SUPPORTED. This error may appear if you passed in a non-contiguous input.

The forward pass seems to work fine. I check all the variables whether they are contiguous and they are. Also my prediction and target for the loss calculation are contiguous and also the returned loss. But then this error occurs when calling backward(). Any ideas why this would occur?

CUDA Version 8.0.61

Python 3.5.2

Comment Summary:

There are 210 images in one sequence, therefore, my batch size is in steps of 210. Each image as a shape of [3, 250, 250].
I'm using the PyTorch backward, haven't implemented any backward method myself.

What size is your data? Is it possible that when you increase the batch size you're padding the data with invalid values? Can you share your code & data? — finbarr, Jul 24 '17 at 21:41
I need to process a sequence of 210 images, three color channels, 250 x 250 per data point (that's why my batch size is in steps of 210). I'm padding with 0's. Which also applies for small batch sizes, so I don't think that's the problem. — timbmg, Jul 25 '17 at 07:41
Looks really strange. Could the problem be that your're combining inputs (i.e. gradients) for some computation during backward that were computed on different devices, and the problem only occurs with the larger batch_size, because only then get these inputs split over different devices? — mbpaulus, Jul 25 '17 at 08:59
I'm using the pytorch backward, haven't implemented any backward method myself. As I only have one GPU, it should give a different error when tensors are on GPU and CPU, I believe. So that should be fine too. — timbmg, Jul 25 '17 at 09:03
What is the version of your cuDNN installation? You might want to consider upgrading. — entrophy, Jul 26 '17 at 19:25
I do not have cuDNN installed. Do I even need it? I'm not developing CUDA code per se, just running it with pytorch. — timbmg, Jul 27 '17 at 09:07

CUDNN Error in backprop for big batches

0 Answers0