About cosine similarity, how to choose the loss function and the network(I have two plans)

Question

Sorry I have no clue, I don't know where to find a solution. I'm using two networks to construct two embeddings，I have binary target to indicate whether embeddingA and embeddingB "match" or not(1 or -1). The dataset like this:

embA0 embB0 1.0
embA1 embB1 -1.0
embA2 embB2 1.0
...

I hope to use cosine similarity to get classification results. But I feel confused when choosing the loss function, the two networks that generate embeddings are trained separately, now I can think of two options as follows:

Plan 1:

Construct the 3rd network, use embeddingA and embeddingB as the input of nn.cosinesimilarity() to calculate the final result (should be probability in [-1,1] ), and then select a two-category loss function.

(Sorry, I dont know which loss function to choose.)

class cos_Similarity(nn.Module):
    def __init__(self):
        super(cos_Similarity,self).__init__()
        cos=nn.CosineSimilarity(dim=2)
        embA=generator_A()
        embB=generator_B()

    def forward(self,a,b):
        output_a=embA(a)
        output_b=embB(b)
        return cos(output_a,output_b)
loss_func=nn.CrossEntropyLoss()

y=cos_Similarity(a,b)
loss=loss_func(y,target)
acc=np.int64(y>0)

Plan 2: The two Embeddings as the output, then use nn.CosineEmbeddingLoss() as loss function, when I calculate the accuracy, I use nn.Cosinesimilarity() to output the result(probability in [-1,1]).

output_a=embA(a)
output_b=embB(b)

cos=nn.CosineSimilarity(dim=2)
loss_function = torch.nn.CosineEmbeddingLoss()

loss=loss_function(output_a,output_b,target)
acc=cos(output_a,output_b)

I really need help. How do I make a choice? Why? Or I can only make a choice for me through experimental results. Thank you very much!

###############################addition


def train_func(train_loss_list):

    train_data=load_data('train')
    trainloader = DataLoader(train_data, batch_size=BATCH_SIZE)
    
    cos_smi=nn.CosineSimilarity(dim=2)
    train_loss = 0
    
    for step,(a,b,target) in enumerate(trainloader):

        try:
            optimizer.zero_grad()

            output_a = model_A(a) #generate embA
            output_b = model_B(b) #generate embB
            
            acc=cos_smi(output_a,output_b)

            loss = loss_fn(output_a,output_b, target.unsqueeze(dim=1))
            
            train_loss += loss.item()
            
            loss.backward()
            
            optimizer.step()
            
            train_loss_list.append(loss)
            

            if step%10==0:
                print('train:',step,'step','loss:',loss,'acc',acc)
                
                
        except Exception as e:
            print('train:',step,'step')
            print(repr(e))

    return train_loss_list,train_loss/len(trainloader)

Are you getting the code from somewhere else? If that's the case, could you also link it so that I can have a better look. In general if you're dealing with binary classification it might be easier to use `nn.BCELoss` rather than `nn.CrossEntropyLoss`. — Sean, Sep 05 '20 at 06:29
@Seankala I coded the code myself... because I overestimated myself:( The loss function is very helpful! Because I am trying to run plan1, the model did not converge. — island145287, Sep 06 '20 at 03:02
What data are you using? I could try and replicate what you're trying to do. — Sean, Sep 06 '20 at 04:49
@Seankala There are too many forward steps in my project to generate embeddings. However, your ideas have enlightened me. Can I validate which one is better by simulating some data and carrying out experiments? Or I should use real data. — island145287, Sep 06 '20 at 07:08
It doesn't necessarily have to be with "real data," but I would recommend using data that is identically distributed to the data that you're using. What I would personally do is just take out a smaller chunk of the data you're using and debug the model. You just have to make sure that the model converges (in the beginning stages). Also, I'm not sure if you omitted it on purpose in the example, but you should add in `loss.backward()` and `optimizer.step()` to actually train the model. — Sean, Sep 06 '20 at 07:12
@Seankala I edited myquestion, and that is my train_fn code. The loss nerver change, and if the BatchSize less than 10, the loss will have a 0.0x change, but still oscillating, not converging. — island145287, Sep 06 '20 at 09:26
If the loss is oscillating rather than converging, try adjusting your hyperparameters (in particular your learning rate). I'm also confused with the overall training procedure and model architecture. If I'm not mistaken, what you're trying to do is train a model that will determine whether two embedding vectors are "similar" or not. If I were you, I would code one model to take care of this. I'll write it in an answer as the comments are limiting. — Sean, Sep 06 '20 at 09:47

score 1 · Answer 1 · answered Sep 06 '20 at 10:02

In response to the comment thread.

The objective or pipeline seems to be:

Receive two embedding vectors (say, A and B).
Check whether these two vectors are "similar" or not (using cosine similarity).
Label is 1 if they're similar, and -1 otherwise (I recommend changing this to 0 or 1 rather than -1 and 1).

What I can think of is the following. Correct me if I misunderstood something. A disclaimer is that I'm pretty much coding this off of my intuition without knowing any details, so it's probably going to be riddled with errors if you try to run in. Let's still try and get a high-level understanding.

Model

import torch
import torch.nn as nn


class Model(nn.Module):
    def __init__(self, num_emb, emb_dim): # I'm assuming the embedding matrices are same sizes.
        self.embedding1 = nn.Embedding(num_embeddings=num_emb, embedding_dim=emb_dim)
        self.embedding2 = nn.Embedding(num_embeddings=num_emb, embedding_dim=emb_dim)
        self.cosine = nn.CosineSimilarity()
        self.sigmoid = nn.Sigmoid()

    def forward(self, a, b):
        output1 = self.embedding1(a)
        output2 = self.embedding2(b)
        similarity = self.cosine(output1, output2)
        output = self.sigmoid(similarity)

        return output

Training/Evaluation

model = Model(num_emb, emb_dim)

if torch.cuda.is_available():
    model = model.to('cuda')

model.train()

criterion = loss_function()
optimizer = some_optimizer()

for epoch in range(num_epochs):
    epoch_loss = 0
    for batch in train_loader:
        optimizer.zero_grad()

        a, b, label = batch

        if torch.cuda.is_available():
            a = a.to('cuda')
            b = b.to('cuda')
            label = label.to('cuda')

        output = model(a, b)

        loss = criterion(output, label)
        loss.backward()
        optimizer.step()

        epoch_loss += loss.cpu().item()

        print("Epoch %d \t Loss %.6f" % epoch, epoch_loss)

I omitted some detailed (e.g., hyperparameter values, loss function and optimizer, etc.). Is this overall procedure something similar to what you're looking for OP?

Thank you very much, I tried your code in my project and made appropriate modifications, but the model still has not converged. So I re-screened the embedding generation part, which will take some time. Looking forward to future good news, thank you again for helping! — island145287, Sep 08 '20 at 01:02

score 0 · Answer 2 · answered Apr 07 '21 at 12:55

You could use the triplet loss function to train. Your input is a set of embeddings (say for 1000 rows). Say each of this is encoded in 200 dimensions. You also have similarity labels. So for e.g. row 1 could be similar to 20 of the 1000 rows and dis-similar to the remaining 980 rows. You could then use triplet loss function for row 1 by taking 1 +ve and 1 -ve match each time. You could do this for all the 1000 rows in the train. This way the embeddings are now fine-tuned better. This is the training phase.

Now for inference, you could just find out cosine similarity to determine which rows are close to each other and which are not (k nearest where k=1). I assume this is what the goal of your model is.

We assume here that the embeddings are 'transferable' in the sense that it is coming from something like BERT (text) or imagenet (images) and these embeddings can be finetuned by adding a layer on top

About cosine similarity, how to choose the loss function and the network(I have two plans)

2 Answers2

Model

Training/Evaluation