PyTorch gradients have different shape for CUDA and CPU

Question

I’m dealing with a strange issue where the gradients after backward pass have different shapes depending on whether CUDA or CPU is used. The model used is relatively simple:

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool1 = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.pool2 = nn.MaxPool2d(2, 2)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)
        self.relu1 = nn.ReLU()
        self.relu2 = nn.ReLU()
        self.relu3 = nn.ReLU()
        self.relu4 = nn.ReLU()

    def forward(self, x):
        x = self.pool1(self.relu1(self.conv1(x)))
        x = self.pool2(self.relu2(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = self.relu3(self.fc1(x))
        x = self.relu4(self.fc2(x))
        x = self.fc3(x)
        return x

The input tensor has shape (1, 3, 32, 32), and the relevant section of code is as follows, with the method generate_gradients being of particular importance:

class VanillaBackprop():
    """
        Produces gradients generated with vanilla back propagation from the image
    """
    def __init__(self, model):
        self.model = model
        self.gradients = None
        # Put model in evaluation mode
        self.model.eval()
        # Hook the first layer to get the gradient
        self.hook_layers()
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model.to(self.device)


    def hook_layers(self):
        def hook_function(module, grad_in, grad_out):
            self.gradients = grad_in[0]

        # Register hook to the first layer
        try:
            first_layer = list(self.model.features._modules.items())[0][1]
        except:
            first_layer = list(self.model._modules.items())[0][1]
        first_layer.register_backward_hook(hook_function)

    def generate_gradients(self, input_image, target_class):
        # Forward
        model_output = self.model(input_image.to(self.device))
        # Zero grads
        self.model.zero_grad()
        # Target for backprop
        one_hot_output = torch.FloatTensor(1, model_output.size()[-1]).zero_()
        one_hot_output[0][target_class] = 1
        # Backward pass
        model_output.backward(gradient=one_hot_output.to(self.device))
        # Convert Pytorch variable to numpy array
        gradients_as_arr = self.gradients.data.cpu().numpy()[0]
        return gradients_as_arr

When on CPU, self.gradients has shape (1, 3, 32, 32), while on CUDA it has shape (1, 6, 28, 28). How is that possible, and how do I fix this? Any help is much appreciated.

The try-except in hook_layers looks suspicious.. What exception is it supposed to handle? — BlackBear, Jul 20 '20 at 15:13
The try-except is there to handle non-torchvision models. Most torchvision models consist of `model.features` (conv + activation + pooling blocks), and `model.classifier` (the fully connected layers at the end). You'll notice that a basic convnet (such as the one I created or any nn.Sequential model) lacks these two subcomponents. Essentially, I am iterating over the layers, and for a torchvision model you need to iterate over `model.features._modules`, while for a non-torchvision one you'll need `self.model._modules`. It's not foolproof, but it's good enough for my purposes here. — Semih Cantürk, Jul 20 '20 at 15:31
Okay. `(1, 6, 28, 28)` looks like the output of the first convolutional layer. Can you check that the first layer is actually the first layer in both cases? — BlackBear, Jul 20 '20 at 15:41
Indeed, the first layer is actually the first in both cases, and iterating through the forward method results in the same shapes after each layer for both. For example, `x = self.pool1(self.relu1(self.conv1(x)))` outputs `x` with shape `(1, 6, 14, 14)` both on CUDA and CPU, which is expected behavior. — Semih Cantürk, Jul 20 '20 at 16:04

score 1 · Accepted Answer · answered Aug 05 '20 at 12:09

It looks like the issue stems from the register_backward_hook() function. As pointed out in the PyTorch forums:

You might want to double check the register_backward_hook() doc. But it is known to be kind of broken at the moment and can have this behavior.

I would recommend you use autograd.grad() for this though. That will make it simpler than backward+access to the .grad field.

I, however, opted to use register_hook() instead of register_backward_hook() (as opposed to autograd.grad() as suggested), which seems to work as well:

class VanillaBackprop():
    """
        Produces gradients generated with vanilla back propagation from the image
    """
    def __init__(self, model):
        self.model = model
        self.gradients = None
        # Put model in evaluation mode
        self.model.eval()
        # Hook the first layer to get the gradient

    def hook_input(self, input_tensor):
        def hook_function(grad_in):
            self.gradients = grad_in
        input_tensor.register_hook(hook_function)

    def generate_gradients(self, input_image, target_class):
        # Register input hook
        self.hook_input(input_image)
        # Forward
        model_output = self.model(input_image)
        # Zero grads
        self.model.zero_grad()
        # Target for backprop
        device = next(self.model.parameters()).device
        one_hot_output = torch.FloatTensor(1, model_output.size()[-1]).zero_()
        one_hot_output[0][target_class] = 1
        one_hot_output = one_hot_output.to(device)
        # Backward pass
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        model_output.backward(gradient=one_hot_output.to(device))
        # Convert Pytorch variable to numpy array
        # [0] to get rid of the first channel (1,3,224,224)
        gradients_as_arr = self.gradients.data.cpu().numpy()[0]
        return gradients_as_arr

PyTorch gradients have different shape for CUDA and CPU

1 Answers1