I need to train two models in parallel. Each model has a different activation function with trainable parameters. I want to train model one and model two in the way that the parameters of the activation function from model one (e.g., alpha1) is separated from the parameters in model two (e.g., alpha2) by a gap of 2; i.e., |alpha_1 - alpha_2| > 2. I wonder how I could include it into the loss function for training.
1 Answers
Example module definition
I will use torch.nn.PReLU
as parametric activation you talk about.
get_weight
created for convenience.
import torch
class Module(torch.nn.Module):
def __init__(self, in_features, out_features):
super().__init__()
self.input = torch.nn.Linear(in_features, 2 * in_features)
self.activation = torch.nn.PReLU()
self.output = torch.nn.Linear(2 * in_features, out_features)
def get_weight(self):
return self.activation.weight
def forward(self, inputs):
return self.output(self.activation(self.inputs(inputs)))
Modules and setup
Here I'm using one optimizer to optimize parameters of both modules you talk about. criterion
can be mean squared error
, cross entropy
or any other thing you need.
module1 = Module(20, 1)
module2 = Module(20, 1)
optimizer = torch.optim.Adam(
itertools.chain(module1.parameters(), module2.parameters())
)
critertion = ...
Training
Here is a single step, you should pack it in a for-loop over your data as is usually done, hopefully it's enough for you to get the idea:
inputs = ...
targets = ...
output1 = module1(inputs)
output2 = module2(inputs)
loss1 = criterion(output1, targets)
loss2 = criterion(output2, targets)
total_loss = loss1 + loss2
total_loss += torch.nn.functional.relu(
2 - torch.abs(module1.get_weight() - module2.get_weight()).sum()
)
total_loss.backward()
optimizer.step()
This line is what you are after in this case:
total_loss += torch.nn.functional.relu(
2 - torch.abs(module1.get_weight() - module2.get_weight()).sum()
)
relu
is used so the network won't reap infinite benefit solely from creating divergent weights. If there wasn't one, loss would become negative the greater the difference between weights would be. In this case the bigger the difference the better, but it makes no difference after the gap is greater or equal to 2
.
You may have to increase 2
to 2.1
or something if you have to pass the threshold of 2
as the incentive to optimize the value when it's close to 2.0
would be small.
Edit
Without explicitly given threshold it might be hard, but maybe something like this would work:
total_loss = (
(torch.abs(module1) + torch.abs(module2)).sum()
+ (1 / torch.abs(module1) + 1 / torch.abs(module2)).sum()
- torch.abs(module1 - module2).sum()
)
It's kinda hackish for the network, but might be worth a try (if you apply additional L2
regularization).
In essence, this loss will have optimum at -inf, +inf
pairs of weights in the corresponding positions and never will be smaller than zero.
For those weights
weights_a = torch.tensor([-1000.0, 1000, -1000, 1000, -1000])
weights_b = torch.tensor([1000.0, -1000, 1000, -1000, 1000])
Loss for each part will be:
(torch.abs(module1) + torch.abs(module2)).sum() # 10000
(1 / torch.abs(module1) + 1 / torch.abs(module2)).sum() # 0.0100
torch.abs(module1 - module2).sum() # 10000
In this case network can reap easy benefits just by making the weights greater with opposite signs in both modules and disregard what you want to optimize (large L2
on weights of both modules might help and I think optimum value would be 1
/-1
in case L2
's alpha
is equal to 1
) and I suspect the network might be highly unstable.
With this loss function if the network gets a sign of large weight wrong it will be heavily penalized.
In this case you would be left with L2
alpha parameter to tune to make it work, which is not that strict, but still requires a hyperparameter choice.

- 22,747
- 4
- 43
- 83
-
Thanks for your answer. It is what I was after in. I wonder if it is possible to set the margin, i.e., 2 unknown so that the network train increasing the margin of the weights as much as it can. – Kevin May 21 '20 at 14:47
-
@Kevin Tried to do something like you want (remember about `L2`) but you are still left with hyperparameter. Margin isn't a problem per se (without `L2` it would go towards the largest possible margin), but that it can create pairs of `+inf, -inf` weights and drive this whole loss towards zero (yet not learn anything). If you want to optimize something you need a point where the minimum exists, in this case it might be considered "softer" due to L2, but it's not a giant change. In your case you don't want either `0.0` weights nor `+inf/-inf` so there's always a tradeoff. – Szymon Maszke May 21 '20 at 16:44
-
@Kevin also notice I've added `.sum()` so there is a single value added to loss in both possible solutions. – Szymon Maszke May 21 '20 at 16:46
-
Thanks for your clarification. I wonder if in your training loop I use a for loop to update one model at each time, instead of updating two models at once, then both types of training would be equivalent. – Kevin May 22 '20 at 23:17
-
You could do either but it wouldn't be equivalent as only one set of weights would be updated at a time. – Szymon Maszke May 23 '20 at 08:33
-
Thanks for your answer. In the function module1.get_weight() should I detach the weights before returning? I realized without detaching the network isn't learning. What's the exact effect? – Kevin Jun 16 '20 at 03:04