pytorch freeze weights and update param_groups

Question

Freezing weights in pytorch for param_groups setting.

So if one wants to freeze weights during training:

for param in child.parameters():
    param.requires_grad = False

the optimizer also has to be updated to not include the non gradient weights:

optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=opt.lr, amsgrad=True)

If one wants to use different weight_decay / learning rates for bias and weights/this also allows for differing learning rates:

param_groups = [{'params': model.module.bias_parameters(), 'weight_decay': args.bias_decay},
                {'params': model.module.weight_parameters(), 'weight_decay': args.weight_decay}]

param_groups a list of dics is defined and passed into SGD as follows:

optimizer = torch.optim.Adam(param_groups, args.lr,
                                 betas=(args.momentum, args.beta))

How can this be achieved with freezing individual weights? Running filter over a list of dics or is there a way of adding tensors to the optimizer separately?

MBT · Accepted Answer · 2018-11-06T11:29:38.660

Actually I think you don't have to update the optimizer. The Parameters handed over to the optimizer are just references.

So when you change the requires_grad flag it will immediately be updated.

But even if that would for some reason not be the case - as soon as you you set the requires_grad flag to be False you cannot anymore calculate gradients any new gradients (see at the bottom with None and zero gradients) for this weight, so the gradient won't change anymore and if you use optimizer.zero_grad() it will just stay zero.

So if there is no gradient, then there is also no need to exclude these from the optimizer. Because without gradient the optimizer will just do nothing, no matter what learning rate you use.

Here is an small example to show this behaviour:

import torch
import torch.nn as nn
import torch.optim as optim

n_dim = 5

p1 = nn.Linear(n_dim, 1)
p2 = nn.Linear(n_dim, 1)

optimizer = optim.Adam(list(p1.parameters())+list(p2.parameters()))
p2.weight.requires_grad = False
for i in range(4):
    dummy_loss = (p1(torch.rand(n_dim)) + p2(torch.rand(n_dim))).squeeze()
    optimizer.zero_grad()
    dummy_loss.backward()
    optimizer.step()
    print('p1: requires_grad =', p1.weight.requires_grad, ', gradient:', p1.weight.grad)
    print('p2: requires_grad =', p2.weight.requires_grad, ', gradient:', p2.weight.grad)
    print()

    if i == 1:
        p1.weight.requires_grad = False
        p2.weight.requires_grad = True

Output:

p1: requires_grad = True , gradient: tensor([[0.8522, 0.0020, 0.1092, 0.8167, 0.2144]])
p2: requires_grad = False , gradient: None

p1: requires_grad = True , gradient: tensor([[0.7635, 0.0652, 0.0902, 0.8549, 0.6273]])
p2: requires_grad = False , gradient: None

p1: requires_grad = False , gradient: tensor([[0., 0., 0., 0., 0.]])
p2: requires_grad = True , gradient: tensor([[0.1343, 0.1323, 0.9590, 0.9937, 0.2270]])

p1: requires_grad = False , gradient: tensor([[0., 0., 0., 0., 0.]])
p2: requires_grad = True , gradient: tensor([[0.0100, 0.0123, 0.8054, 0.9976, 0.6397]])

Here you can see that no gradients are calculated. You may have notice the gradient for p2 is None at the beginning and later on it is tensor([[0., 0., 0., 0., 0.]]) for p1 instead of None after deactivating gradients.

This is the case because p1.weight.grad is just a variable which is modified by backward() and optimizer.zero_grad().

So at the beginning p1.weight.grad is just initialized with None, after the gradients are written or accumulated to this variable they won't be cleared automatically. But because optimizer.zero_grad() is called they are set to zero and stay like this since backward() cannot anymore calculate new gradients with requires_grad=False.

You can also change the code in the if-statement to:

if i == 1:
    p1.weight.requires_grad = False
    p1.weight.grad = None
    p2.weight.requires_grad = True

So once reset to None they are left untouched and stay None:

p1: requires_grad = True , gradient: tensor([[0.2375, 0.7528, 0.1501, 0.3516, 0.3470]])
p2: requires_grad = False , gradient: None

p1: requires_grad = True , gradient: tensor([[0.5181, 0.5178, 0.6590, 0.6950, 0.2743]])
p2: requires_grad = False , gradient: None

p1: requires_grad = False , gradient: None
p2: requires_grad = True , gradient: tensor([[0.4797, 0.7203, 0.2284, 0.9045, 0.6671]])

p1: requires_grad = False , gradient: None
p2: requires_grad = True , gradient: tensor([[0.8344, 0.1245, 0.0295, 0.2968, 0.8816]])

I hope this makes sense to you!

Makes sense, although is there additional overhead by having the tensors in the computational graph yet not optimizing them? — Benedict K., Nov 06 '18 at 12:20
@BenedictK. They are not added to the graph if `requires_grad` is `False`. This flag actually adds them to the graph. Meaning, for tensors with `requires_grad=True`, autograd keeps track of what calculations are done etc. these informations are saved in buffers. But with `requires_grad=False` these informations won't be added to the graph. Quote: *"it’s enough to switch the `requires_grad` flags in the frozen base, and no intermediate buffers will be saved"* - That is what this flag is actually for, you can take a look here: https://pytorch.org/docs/stable/notes/autograd.html#requires-grad — MBT, Nov 06 '18 at 12:56
This sentence from the [above linked site of the PyTorch-Documentation](https://pytorch.org/docs/stable/notes/autograd.html#autograd-mechanics) states it even more clear: *"Every Tensor has a flag: `requires_grad` that allows for fine grained exclusion of subgraphs from gradient computation and can increase efficiency."* — MBT, Nov 06 '18 at 13:03
One more thing I that came to my mind what might explain it better: The *optimizer* itself actually has nothing to do with the *graph*. It just uses the already computed *graph* to alter the weights. So what whatever you put in the *optimizer* will not change the *graph*. But you can control the *graph* by setting the `requires_grad` flag accordingly to include or exclude certain calculations from the graph. — MBT, Nov 06 '18 at 13:13
@blue-phoenox So if an intermediate entity was set with `requires_grad=False`, it would calculate a temporary grad value without actually strong it. That would explain how preceding layers manage to get a `grad` value. — Rakshit Kothari, Sep 14 '19 at 14:48

pytorch freeze weights and update param_groups

1 Answers1