Training models interactively in Pytorch

Question

I need to train two models in parallel. Each model has a different activation function with trainable parameters. I want to train model one and model two in the way that the parameters of the activation function from model one (e.g., alpha1) is separated from the parameters in model two (e.g., alpha2) by a gap of 2; i.e., |alpha_1 - alpha_2| > 2. I wonder how I could include it into the loss function for training.

Szymon Maszke · Answer 1 · 2020-05-21T16:40:51.433

Example module definition

I will use torch.nn.PReLU as parametric activation you talk about. get_weight created for convenience.

import torch


class Module(torch.nn.Module):
    def __init__(self, in_features, out_features):
        super().__init__()
        self.input = torch.nn.Linear(in_features, 2 * in_features)
        self.activation = torch.nn.PReLU()
        self.output = torch.nn.Linear(2 * in_features, out_features)

    def get_weight(self):
        return self.activation.weight

    def forward(self, inputs):
        return self.output(self.activation(self.inputs(inputs)))

Modules and setup

Here I'm using one optimizer to optimize parameters of both modules you talk about. criterion can be mean squared error, cross entropy or any other thing you need.

module1 = Module(20, 1)
module2 = Module(20, 1)

optimizer = torch.optim.Adam(
    itertools.chain(module1.parameters(), module2.parameters())
)
critertion = ...

Training

Here is a single step, you should pack it in a for-loop over your data as is usually done, hopefully it's enough for you to get the idea:

inputs = ...
targets = ...

output1 = module1(inputs)
output2 = module2(inputs)

loss1 = criterion(output1, targets)
loss2 = criterion(output2, targets)

total_loss = loss1 + loss2
total_loss += torch.nn.functional.relu(
    2 - torch.abs(module1.get_weight() - module2.get_weight()).sum()
)
total_loss.backward()

optimizer.step()

This line is what you are after in this case:

total_loss += torch.nn.functional.relu(
    2 - torch.abs(module1.get_weight() - module2.get_weight()).sum()
)

relu is used so the network won't reap infinite benefit solely from creating divergent weights. If there wasn't one, loss would become negative the greater the difference between weights would be. In this case the bigger the difference the better, but it makes no difference after the gap is greater or equal to 2.

You may have to increase 2 to 2.1 or something if you have to pass the threshold of 2 as the incentive to optimize the value when it's close to 2.0 would be small.

Edit

Without explicitly given threshold it might be hard, but maybe something like this would work:

total_loss = (
    (torch.abs(module1) + torch.abs(module2)).sum()
    + (1 / torch.abs(module1) + 1 / torch.abs(module2)).sum()
    - torch.abs(module1 - module2).sum()
)

It's kinda hackish for the network, but might be worth a try (if you apply additional L2 regularization).

In essence, this loss will have optimum at -inf, +inf pairs of weights in the corresponding positions and never will be smaller than zero.

For those weights

weights_a = torch.tensor([-1000.0, 1000, -1000, 1000, -1000])
weights_b = torch.tensor([1000.0, -1000, 1000, -1000, 1000])

Loss for each part will be:

(torch.abs(module1) + torch.abs(module2)).sum() # 10000
(1 / torch.abs(module1) + 1 / torch.abs(module2)).sum() # 0.0100
torch.abs(module1 - module2).sum() # 10000

In this case network can reap easy benefits just by making the weights greater with opposite signs in both modules and disregard what you want to optimize (large L2 on weights of both modules might help and I think optimum value would be 1/-1 in case L2's alpha is equal to 1) and I suspect the network might be highly unstable.

With this loss function if the network gets a sign of large weight wrong it will be heavily penalized.

In this case you would be left with L2 alpha parameter to tune to make it work, which is not that strict, but still requires a hyperparameter choice.

Thanks for your answer. It is what I was after in. I wonder if it is possible to set the margin, i.e., 2 unknown so that the network train increasing the margin of the weights as much as it can. — Kevin, May 21 '20 at 14:47
@Kevin Tried to do something like you want (remember about `L2`) but you are still left with hyperparameter. Margin isn't a problem per se (without `L2` it would go towards the largest possible margin), but that it can create pairs of `+inf, -inf` weights and drive this whole loss towards zero (yet not learn anything). If you want to optimize something you need a point where the minimum exists, in this case it might be considered "softer" due to L2, but it's not a giant change. In your case you don't want either `0.0` weights nor `+inf/-inf` so there's always a tradeoff. — Szymon Maszke, May 21 '20 at 16:44
@Kevin also notice I've added `.sum()` so there is a single value added to loss in both possible solutions. — Szymon Maszke, May 21 '20 at 16:46
Thanks for your clarification. I wonder if in your training loop I use a for loop to update one model at each time, instead of updating two models at once, then both types of training would be equivalent. — Kevin, May 22 '20 at 23:17
You could do either but it wouldn't be equivalent as only one set of weights would be updated at a time. — Szymon Maszke, May 23 '20 at 08:33
Thanks for your answer. In the function module1.get_weight() should I detach the weights before returning? I realized without detaching the network isn't learning. What's the exact effect? — Kevin, Jun 16 '20 at 03:04

Training models interactively in Pytorch

1 Answers1

Example module definition

Modules and setup

Training

Edit