Pytorch Siamese Network not converging

Question

Good morning everyone

Below is my implementation of a pytorch siamese network. I am using 32 batch size, MSE loss and SGD with 0.9 momentum as optimizer.

class SiameseCNN(nn.Module):
    def __init__(self):
        super(SiameseCNN, self).__init__()                                      # 1, 40, 50
        self.convnet = nn.Sequential(nn.Conv2d(1, 8, 7), nn.ReLU(),             # 8, 34, 44
                                    nn.Conv2d(8, 16, 5), nn.ReLU(),             # 16, 30, 40
                                    nn.MaxPool2d(2, 2),                         # 16, 15, 20
                                    nn.Conv2d(16, 32, 3, padding=1), nn.ReLU(), # 32, 15, 20
                                    nn.Conv2d(32, 64, 3, padding=1), nn.ReLU()) # 64, 15, 20
        self.linear1 = nn.Sequential(nn.Linear(64 * 15 * 20, 100), nn.ReLU())
        self.linear2 = nn.Sequential(nn.Linear(100, 2), nn.ReLU())
        
    def forward(self, data):
        res = []
        for j in range(2):
            x = self.convnet(data[:, j, :, :])
            x = x.view(-1, 64 * 15 * 20)
            res.append(self.linear1(x))
        fres = abs(res[1] - res[0])
        return self.linear2(fres)

Each batch contains alternating pairs, i.e [pos, pos], [pos, neg], [pos, pos] etc... However, the network doesn't converge, and the problem seems that fres in the network is the same for each pair (regardless of whether it is a positive or negative pair), and the output of self.linear2(fres) is always approximately equal to [0.0531, 0.0770]. This is in contrast with what I am expecting, which is that the first value of [0.0531, 0.0770] would get closer to 1 for a positive pair as the network learns, and the second value would get closer to 1 for a negative pair. These two values also need to sum up to 1.

I have tested exactly the same setup and same input images for a 2 channel network architecture, where, instead of feeding in [pos, pos] you would stack those 2 images in a depth-wise fashion, for example numpy.stack([pos, pos], -1). The dimension of nn.Conv2d(1, 8, 7) also changes to nn.Conv2d(2, 8, 7) in this setup. This works perfectly fine.

I have also tested exactly the same setup and input images for a traditional CNN approach, where I just pass in single positive and negative grey scale images into the network, instead of stacking them (as with the 2-CH approach) or passing them in as image pairs (as with the Siamese approach). This also works perfectly, but the results are not so good as with the 2 channel approach.

EDIT (Solutions I've tried):

I have tried a number of different loss functions, including HingeEmbeddingLoss and CrossEntropyLoss, all resulting in more or less the same problem. So I think it is safe to say that the problem is not caused by the employed loss function; MSELoss.
Different batch sizes also seem to have no effect on the issue.
I tried increasing the number of trainable parameters as suggested in Keras Model for Siamese Network not Learning and always predicting the same ouput Also doesn't work.
Tried to change the network architecture as implemented here: https://github.com/benmyara/pytorch-examples/blob/master/notebooks/1_NeuralNetworks/9_siamese_nn.ipynb. In other words, changed the forward pass to the following code. Also changed the loss to CrossEntropy, and the optimizer to Adam. Still no luck:

def forward(self, data):
    res = []
    for j in range(2):
        x = self.convnet(data[:, j, :, :])
        x = x.view(-1, 64 * 15 * 20)
        res.append(x)
    fres = self.linear2(self.linear1(abs(res[1] - res[0]))))
    return fres

I also tried to change the whole network from a CNN to a linear network as implemented here: https://github.com/benmyara/pytorch-examples/blob/master/notebooks/1_NeuralNetworks/9_siamese_nn.ipynb. Still doesn't work.
Tried to use a lot more data as suggested here: Keras Model for Siamese Network not Learning and always predicting the same ouput. No luck...
Tried to use torch.nn.PairwiseDistance between the outputs of convnet. Made some sort of improvement; the network starts to converge for the first few epochs, and then hits the same plateau everytime:

def forward(self, data):
    res = []
    for j in range(2):
        x = self.convnet(data[:, j, :, :])
        res.append(x)
    pdist = nn.PairwiseDistance(p=2)
    diff = pdist(res[1], res[0])
    diff = diff.view(-1, 64 * 15 * 10)
    fres = self.linear2(self.linear1(diff))
    return fres

Another thing to note perhaps is that, within the context of my research, a Siamese network is trained for each object. So the first class is associated with the images containing the object in question, and the second class is associated with images containing other objects. Don't know if this might be the cause of the problem. It is however not a problem within the context of the Traditional CNN and 2-Channel CNN approaches.

As per request, here is my training code:

model = SiameseCNN().cuda()
ls_fn = torch.nn.BCELoss()
optim = torch.optim.SGD(model.parameters(),  lr=1e-6, momentum=0.9)
epochs = np.arange(100)
eloss = []
for epoch in epochs:
    model.train()
    train_loss = []
    for x_batch, y_batch in dp.train_set:
        x_var, y_var = Variable(x_batch.cuda()), Variable(y_batch.cuda())
        y_pred = model(x_var)
        loss = ls_fn(y_pred, y_var)
        train_loss.append(abs(loss.item()))
        optim.zero_grad()
        loss.backward()
        optim.step()
    eloss.append(np.mean(train_loss))
    print(epoch, np.mean(train_loss))

Note dp in dp.train_set is a class with attributes train_set, valid_set, test_set, where each set is created as follows:

DataLoader(TensorDataset(torch.Tensor(x), torch.Tensor(y)), batch_size=bs)

As per request, here is an example of the predicted probabilities vs true label, where you can see the model doesn't seem to be learning:

Predicted:  0.5030623078346252 Label:  1.0
Predicted:  0.5030624270439148 Label:  0.0
Predicted:  0.5030624270439148 Label:  1.0
Predicted:  0.5030625462532043 Label:  0.0
Predicted:  0.5030625462532043 Label:  1.0
Predicted:  0.5030626654624939 Label:  0.0
Predicted:  0.5030626058578491 Label:  1.0
Predicted:  0.5030627250671387 Label:  0.0
Predicted:  0.5030626654624939 Label:  1.0
Predicted:  0.5030627846717834 Label:  0.0
Predicted:  0.5030627250671387 Label:  1.0
Predicted:  0.5030627846717834 Label:  0.0
Predicted:  0.5030627250671387 Label:  1.0
Predicted:  0.5030628442764282 Label:  0.0
Predicted:  0.5030627846717834 Label:  1.0
Predicted:  0.5030628442764282 Label:  0.0

Using cosine similarity or a correlation coefficient for comparing the network bodies might produce more stable results than `abs(res[1] - res[0])`. I've actually experienced this exact same problem on one of my own projects, but I haven't yet gotten around to fixing it — bug_spray, May 14 '20 at 10:13
Thanks @bug_spray. Tried it but still results in the same problem. Was a good idea though. — Emile Beukes, May 14 '20 at 11:17
I don't think you can expect the two outputs to sum to one unless a.) you use a loss function to encourage this, or b.) you use a softmax layer at the end. — DerekG, May 14 '20 at 13:42
@DerekG Yes I believe you are right. I was under the impression that the `ReLU` is responsible for this but it only caps out negative values. My mistake. Thanks! — Emile Beukes, May 14 '20 at 14:15
Yeah relu just takes the max of the value and 0. Important to note is that `softmax` has poor gradient properties so if you're going to backpropagate through the layer use `log_softmax()` instead. The best solution is probably to leave off the `softmax` layer altogether during training and simply use it for evaluation. — DerekG, May 14 '20 at 16:06
@DerekG thanks that is helpful advice! I am not using softmax anywhere, just ReLU. From my understanding ReLU is one of the preferred activation functions when it comes to preserving the gradients. — Emile Beukes, May 15 '20 at 10:05
The training loop looks correct, I recommend you to not use the `Variable` constructor when loading the batch to `cuda` as this api is deprecated and there is no need to convert input tensors to variables. Plot a batch to see if the y are correctly set — Guillem, Jun 28 '20 at 08:29

score 2 · Answer 1 · answered Jun 27 '20 at 08:38

I think that your approach is correct and you are doing things fine. What looks a bit weird to me is the last layer which has a RELU activation. Usually with Siamese networks you want to output a high probability when the two input images belong to the same class and a low probability otherwise. So you can implement this with a single neuron output and a sigmoid activation function.

Therefore I would reimplement your Network as follows:

class SiameseCNN(nn.Module):
    def __init__(self):
        super(SiameseCNN, self).__init__()                                      # 1, 40, 50
        self.convnet = nn.Sequential(nn.Conv2d(1, 8, 7), nn.ReLU(),             # 8, 34, 44
                                    nn.Conv2d(8, 16, 5), nn.ReLU(),             # 16, 30, 40
                                    nn.MaxPool2d(2, 2),                         # 16, 15, 20
                                    nn.Conv2d(16, 32, 3, padding=1), nn.ReLU(), # 32, 15, 20
                                    nn.Conv2d(32, 64, 3, padding=1), nn.ReLU()) # 64, 15, 20
        self.linear1 = nn.Sequential(nn.Linear(64 * 15 * 20, 100), nn.ReLU())
        self.linear2 = nn.Sequential(nn.Linear(100, 1), nn.Sigmoid())
        
    def forward(self, data):
        for j in range(2):
            x = self.convnet(data[:, j, :, :])
            x = x.view(-1, 64 * 15 * 20)
            res.append(self.linear1(x))
        fres = res[0].sub(res[1]).pow(2)
        return self.linear2(fres)

Then to be consistent whith training you should use a binary crossentropy:

criterion_fn = torch.nn.BCELoss()

And remember to set labels to 1 wehen both input images belong to the same class.

Also, I recommend you to use a little bit of dropout, around 30% probability of dropping a neuron, after the linear1 layer.

Hi Guillem, thank you very much for taking the time to answer. I have tried your approach, and unfortunately it still doesn't work I get the same loss every epoch - `0.69.....`. Added the dropout like you suggested. Tried to use the pairwise distance instead of your `res[0].sub(res[1]).pow(2)`. Increased the learning rate. Set the labels correctly and used `torch.nn.BCELoss()`. Tried a batch_size = 1 instead of 32. Nothing works... — Emile Beukes, Jun 28 '20 at 07:45

score 0 · Accepted Answer · answered Jul 02 '20 at 10:17

Problem solved. Turns out the network will predict the same output every time if you give it the same images every time Small indexing mistake on my part during data partitioning. Thanks for everyone's help and assistance. Here is an example of the convergence as it is now:

0 0.20198837077617646
1 0.17636818194389342
2 0.15786472541093827
3 0.1412761415243149
4 0.126698794901371
5 0.11397973036766053
6 0.10332610329985618
7 0.09474560652673245
8 0.08779258838295936
9 0.08199785630404949
10 0.07704121413826942
11 0.07276330365240574
12 0.06907484836131335
13 0.06584368328005076
14 0.06295975042134523
15 0.06039590438082814
16 0.058096024941653016

Pytorch Siamese Network not converging

2 Answers2