Custom combined hinge/kb-divergence loss function in siamese-net fails to generate meaningful speaker-embeddings

Question

I'm currently trying to implement a siamese-net in Keras where I have to implement the following loss function:

loss(p ∥ q) = Is · KL(p ∥ q) + Ids · HL(p ∥ q)

detailed description of loss function from paper

Where KL is the Kullback-Leibler divergence and HL is the Hinge-loss.

During training, I label same-speaker pairs as 1, different speakers as 0.

The goal is to use the trained net to extract embeddings from spectrograms. A spectrogram is a 2-dimensional numpy-array 40x128 (time x frequency)

The problem is I never get over 0.5 accuracy, and when clustering speaker-embeddings the results show there seems to be no correlation between embeddings and speakers

I implemented the kb-divergence as distance measure, and adjusted the hinge-loss accordingly:

def kullback_leibler_divergence(vects):
    x, y = vects
    x = ks.backend.clip(x, ks.backend.epsilon(), 1)
    y = ks.backend.clip(y, ks.backend.epsilon(), 1)
    return ks.backend.sum(x * ks.backend.log(x / y), axis=-1)


def kullback_leibler_shape(shapes):
    shape1, shape2 = shapes
    return shape1[0], 1


def kb_hinge_loss(y_true, y_pred):
    """
    y_true: binary label, 1 = same speaker
    y_pred: output of siamese net i.e. kullback-leibler distribution
    """
    MARGIN = 1.
    hinge = ks.backend.mean(ks.backend.maximum(MARGIN - y_pred, 0.), axis=-1)
    return y_true * y_pred + (1 - y_true) * hinge

A single spectrogram would be fed into a branch of the base network, the siamese-net consists of two such branches, so two spectrograms are fed simultaneously, and joined in the distance-layer. The output of the base network is 1 x 128. The distance layer computes the kullback-leibler divergence and its output is fed into the kb_hinge_loss. The architecture of the base-network is as follows:

    def create_lstm(units: int, gpu: bool, name: str, is_sequence: bool = True):
        if gpu:
            return ks.layers.CuDNNLSTM(units, return_sequences=is_sequence, input_shape=INPUT_DIMS, name=name)
        else:
            return ks.layers.LSTM(units, return_sequences=is_sequence, input_shape=INPUT_DIMS, name=name)


def build_model(mode: str = 'train') -> ks.Model:
    topology = TRAIN_CONF['topology']

    is_gpu = tf.test.is_gpu_available(cuda_only=True)

    model = ks.Sequential(name='base_network')

    model.add(
        ks.layers.Bidirectional(create_lstm(topology['blstm1_units'], is_gpu, name='blstm_1'), input_shape=INPUT_DIMS))

    model.add(ks.layers.Dropout(topology['dropout1']))

    model.add(ks.layers.Bidirectional(create_lstm(topology['blstm2_units'], is_gpu, is_sequence=False, name='blstm_2')))

    if mode == 'extraction':
        return model

    num_units = topology['dense1_units']
    model.add(ks.layers.Dense(num_units, name='dense_1'))
    model.add(ks.layers.advanced_activations.PReLU(init='zero', weights=None))

    model.add(ks.layers.Dropout(topology['dropout2']))

    num_units = topology['dense2_units']
    model.add(ks.layers.Dense(num_units, name='dense_2'))
    model.add(ks.layers.advanced_activations.PReLU(init='zero', weights=None))

    num_units = topology['dense3_units']
    model.add(ks.layers.Dense(num_units, name='dense_3'))
    model.add(ks.layers.advanced_activations.PReLU(init='zero', weights=None))

    num_units = topology['dense4_units']
    model.add(ks.layers.Dense(num_units, name='dense_4'))
    model.add(ks.layers.advanced_activations.PReLU(init='zero', weights=None))
    return model

I then build a siamese net as follows:

    base_network = build_model()

    input_a = ks.Input(shape=INPUT_DIMS, name='input_a')
    input_b = ks.Input(shape=INPUT_DIMS, name='input_b')

    processed_a = base_network(input_a)
    processed_b = base_network(input_b)

    distance = ks.layers.Lambda(kullback_leibler_divergence,
                                output_shape=kullback_leibler_shape,
                                name='distance')([processed_a, processed_b])

    model = ks.Model(inputs=[input_a, input_b], outputs=distance)
    adam = build_optimizer()
    model.compile(loss=kb_hinge_loss, optimizer=adam, metrics=['accuracy'])

Lastly, I build a net with the same architecture with only one input, and try to extract embeddings, and then build the mean over them, where an embedding should serve as a representation for a speaker, to be used during clustering:

utterance_embedding = np.mean(embedding_extractor.predict_on_batch(spectrogram), axis=0)

We train the net on the voxceleb speaker set.

The full code can be seen here: GitHub repo

I'm trying to figure out if I have made any wrong assumptions and how to improve my accuracy.

There is a lot of information missing in your question... these would help a lot: 1 - The output shape of base_network // 2 - the meaning of the outputs of base_network // 3 - The activation function of base_network's output // 4 - What is embedding extractor? // 5 - What is spectogram, is it the input of the base_network? // 5 - Are P and Q supposed to be the same thing in KL and in HL? This seems a little incompatible — Daniel Möller, Dec 06 '18 at 13:34

Daniel Möller · Accepted Answer · 2018-12-13T19:18:39.680

Issue with accuracy

Notice that in your model:

y_true = labels
y_pred = kullback-leibler divergence

These two cannot be compared, see this example:

For correct results, when y_true == 1 (same speaker), Kullback-Leibler is y_pred == 0 (no divergence).

So it's totally expected that metrics will not work properly.

Then, either you create a custom metric, or you count only on the loss for evaluations.
This custom metric should need a few adjustments in order to be feasible, as explained below.

Possible issues with the loss

Clipping

This might be a problem

First, notice that you're using clip in the values for the Kullback-Leibler. This may be bad because clips lose the gradients in the clipped regions. And since your activation is a PRelu, you have values lower than zero and bigger than 1. Then there are certainly zero gradient cases here and there, with the risk of having a frozen model.

So, you might not want to clip these values. And to avoid having negative values with the PRelu, you can try to use a 'softplus' activation, which is kind of a soft relu without negative values. You might also "sum" an epsilon to avoid trouble, but there is no problem in leaving values bigger than one:

#considering you used 'softplus' instead of 'PRelu' in speakers
def kullback_leibler_divergence(speakers):
    x, y = speakers
    x = x + ks.backend.epsilon()
    y = y + ks.backend.epsilon()
    return ks.backend.sum(x * ks.backend.log(x / y), axis=-1)

Assimetry in Kullback-Leibler

This IS a problem

Notice also that Kullback-Leibler is not a symetric function, and also doesn't have its minimum at zero!! The perfect match is zero, but bad matches can have lower values, and this is bad for a loss function because it will drive you to divergence.

See this picture showing KB's graph

Your paper states that you should sum two losses: (p||q) and (q||p).
This eliminates the assimetry and also the negative values.

So:

distance1 = ks.layers.Lambda(kullback_leibler_divergence,
                            name='distance1')([processed_a, processed_b])
distance2 = ks.layers.Lambda(kullback_leibler_divergence,
                            name='distance2')([processed_b, processed_a])
distance = ks.layers.Add(name='dist_add')([distance1,distance2])

Very low margin and clipped hinge

This might be a problem

Finally, see that the hinge loss also clips values below zero!
Since Kullback-Leibler is not limited to 1, samples with high divergency may not be controled by this loss. Not sure if this really an issue, but you might want to either:

increase the margin
inside the Kullback-Leibler, use mean instead of sum
use a softplus in hinge instead of a max, to avoid losing gradients.

See:

MARGIN = someValue
hinge = ks.backend.mean(ks.backend.softplus(MARGIN - y_pred), axis=-1)

Now we can think of a custom accuracy

This is not very easy, since we don't have clear limits on KB that tells us "correct/not correct"

You might try one at random, but you'd need to tune this threshold parameter until you find a good thing that represents reality. You may for instance use your validation data to find the threshold that brings the best accuracy.

def customMetric(y_true_targets, y_pred_KBL):
    isMatch = ks.backend.less(y_pred_KBL, threshold)
    isMatch = ks.backend.cast(isMatch, ks.backend.floatx())

    isMatch = ks.backend.equal(y_true_targets, isMatch)
    isMatch = ks.backend.cast(isMatch, ks.backend.floatx())

    return ks.backend.mean(isMatch)