4

I am trying to train a Siamese neural network using Keras, with the goal of identifying if 2 images belong to same class or not. My data is shuffled and has equal number of positive examples and negative examples. My model is not learning anything and it is predicting the same output always. I am getting the same loss, validation accuracy, and validation loss every time.

Training Output

def convert(row):
    return imread(row)

def contrastive_loss(y_true, y_pred):
    margin = 1
    square_pred = K.square(y_pred)
    margin_square = K.square(K.maximum(margin - y_pred, 0))
    return K.mean(y_true * square_pred + (1 - y_true) * margin_square)

def SiameseNetwork(input_shape):
    top_input = Input(input_shape)

    bottom_input = Input(input_shape)

    # Network
    model = Sequential()
    model.add(Conv2D(96,(7,7),activation='relu',input_shape=input_shape))
    model.add(MaxPooling2D())
    model.add(Conv2D(256,(5,5),activation='relu',input_shape=input_shape))
    model.add(MaxPooling2D())
    model.add(Conv2D(256,(5,5),activation='relu',input_shape=input_shape))
    model.add(MaxPooling2D())
    model.add(Flatten())
    model.add(Dense(4096,activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(1024,activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(512,activation='relu'))
    model.add(Dropout(0.5))

    encoded_top = model(top_input)
    encoded_bottom = model(bottom_input)

    L1_layer = Lambda(lambda tensors:K.abs(tensors[0] - tensors[1]))    
    L1_distance = L1_layer([encoded_top, encoded_bottom])

    prediction = Dense(1,activation='sigmoid')(L1_distance)
    siamesenet = Model(inputs=[top_input,bottom_input],outputs=prediction)
    return siamesenet

data = pd.read_csv('shuffleddata.csv')
print('Converting X1....')

X1 = [convert(x) for x in data['X1']]

print('Converting X2....')

X2 = [convert(x) for x in data['X2']]

print('Converting Y.....')
Y = [0 if data['Y'][i] == 'Negative' else 1 for i in range(len(data['Y']))]

input_shape = (53,121,3,)
model = SiameseNetwork(input_shape)
model.compile(loss=contrastive_loss,optimizer='sgd',metrics=['accuracy'])
print(model.summary())
model.fit(X1,Y,batch_size=32,epochs=20,shuffle=True,validation_split = 0.2)
model.save('Siamese.h5')

charlesreid1
  • 4,360
  • 4
  • 30
  • 52
S.Hemanth
  • 63
  • 1
  • 10
  • Have you tried with a smaller step size (or even a different optimizer)? Have you tried overfitting on a small part of your dataset? – Zaccharie Ramzi Dec 22 '19 at 10:17
  • Yes,I have tried using smaller step size and with different optimizers and loss functions. I have tried overfitting a small data also but the model is not learning anything. Could you please check if the way I'm giving the input is right or not ? – S.Hemanth Dec 22 '19 at 11:07
  • Hmm well, the layer you call `L1_distance` I am guessing is supposed to be the distance between the 2 outputs of the siamese network, but here it's an error map. You need to compute the mean l1 difference, something like: `Lambda(lambda tensors: K.mean(K.abs(tensors[0] - tensors[1])))`. I am also surprised that you have a dense layer after this layer. Shouldn't the output just be the distance? – Zaccharie Ramzi Dec 22 '19 at 11:21
  • I have tried using your version of L1 distance and also removed the dense layer but it didn't work and by the way I need a probability of how similar the given 2 images are not any distance between the images – S.Hemanth Dec 23 '19 at 04:55
  • Sure makes sense. I was just thinking this might be simpler for start and you could remove the Dense layer and have a sigmoid in your contrastive loss, but I have seen the implementation you used and I understand you want to stick to it. – Zaccharie Ramzi Dec 24 '19 at 08:14
  • Regarding overfitting, have you tried using a single image, forming a pair and then overfitting that? The network should always output a single value then – Zaccharie Ramzi Dec 24 '19 at 08:21
  • This issue has been resolved upto an extent. I have found out that I'm using very less data to train this model. This model is working fine for other standard datsets and even for them if I'm using less data the issue is coming back. So I think I should use more data – S.Hemanth Dec 25 '19 at 09:00

3 Answers3

1

Mentioning the resolution to this issue in this section (even though it is present in Comments Section), for the benefit of the community.

Since the Model is working fine with other Standard Datasets, the solution is to use more Data. Model is not learning because it has less data for Training.

  • thanks for the answer. but I think that Siamese is meant to work with a little amount of data and also give good results. As I know it is used in a few shot learning that needs very little amount of data. so how does the data be the issue here? I have the same issue while I am training on AT&T dataset, it has 400 images for 40 person. I trained a model with this data using Pytorch and the model was learning and gave me good results. when I trained on the same dataset using keras model, it wasn't learning anything. Do you have idea about that? – mohamed_abdullah Jan 06 '21 at 16:15
0

The Model is working fine with more data as mentioned in comments and in the answer by Tensorflow Support. Tweaking the model a little is also working. Changing the number of filters in 2nd and 3rd convolutional layers from 256 to 64 is decreasing the number of trainable parameters by a large number and therefore model started learning.

S.Hemanth
  • 63
  • 1
  • 10
0

I want to mention few things here which may be useful to others:

1) Data stratification / random sampling

When you use validation_split Keras uses the last x percent of data as validation data. This means that if the data is ordered by class, e.g. because "pairs" or "tripletts" are made in a sequence, validation data will only come from classes (or the class) contained in the last x percent of data. In this case, the validation set will be of no use. Thus it is essential to suffle input data to make sure that the validation set contains random samples from each class.

The docs for validation_split say:

Float between 0 and 1. Fraction of the training data to be used as validation data. The model will set apart this fraction of the training data, will not train on it, and will evaluate the loss and any model metrics on this data at the end of each epoch. The validation data is selected from the last samples in the x and y data provided, before shuffling

2) Choice of optimizer

In model.compile() choosing optimizer='sgd' may not be the best approach since sgd can get stuck in local minima etc. Adam (see docs) seems to be a good choice to start with since it...

[...] combines the advantages of [...] AdaGrad to deal with sparse gradients, and the ability of RMSProp to deal with non-stationary objectives.

according to Kingma and Ba (2014, page 10).

from keras.optimizers import Adam
...
model.compile(loss=contrastive_loss, optimizer=keras.optimizers.Adam(lr=0.0001))

3) Early stopping / learning rate

Using early stopping and adjusting the learning rate during training may also be highly useful to achieve good results. So the model can train until there is no more success (stop automatically in this case).

from keras.callbacks import EarlyStopping
from keras.callbacks import ReduceLROnPlateau
...
early_stopping = EarlyStopping(monitor='val_loss', patience=50, mode='auto', restore_best_weights=True)
reduce_on_plateau = ReduceLROnPlateau(monitor="val_loss", factor=0.8, patience=15, cooldown=5, verbose=0)
...
hist = model.fit([img_1, img_2], y, 
            validation_split=.2, 
            batch_size=128, 
            verbose=1, 
            epochs=9999,
            callbacks=[early_stopping])

4) Kernel initialization

Kernel initialization (with a small SD) may be helpful as well.

# Layer 1
    seq.add(Conv2D(8, (5,5), input_shape=input_shape, 
        kernel_initializer=keras.initializers.TruncatedNormal(mean=0.0, stddev=0.01, seed=None), 
        data_format="channels_first"))
    seq.add(Activation('relu'))
    seq.add(MaxPooling2D(pool_size=(2, 2))) 
    seq.add(Dropout(0.1))

5) Overfitting

I noticed that instead of using dropout to fight overfitting, adding some noise can be rather helpful. In this case simply add some GaussianNoise at the top of the network.

Peter
  • 2,120
  • 2
  • 19
  • 33