3

My friends and I are doing an NLP project on song recommendation.

Context: We originally planned on giving the model a recommended song playlist that has the most similar lyrics based on the random input corpus(from the literature etc), however we didn't really have a concrete idea of its implementation.

Currently our task is to find similar lyrics to a random lyric fed as a string input. We are using sentence BERT model(sbert) and cosine similarity to find the similarity between the songs and it seems like the output numbers are meaningful enough to find the most similar song lyrics.

Is there any other way that we can improve this approach?

We'd like to use BERT model and are open to suggestions that can be used on top of BERT if possible, but if there is any other models that should be used instead of BERT, we'd be happy to learn. Thanks.

Nikhil Wani
  • 67
  • 1
  • 5
yyy818
  • 33
  • 2

1 Answers1

0

Computing cosine similarity

You can use the util.cos_sim(embeddings1, embeddings2) from the sentence-transformers package to compute the cosine similarity of two embeddings.

Alternatively, you can also use sklearn.metrics.pairwise.cosine_similarity(X, Y, dense_output=True) from the scikit-learn package.

Improvements for representation and models

Since you want recommendations just on top of BERT, you can consider RoBERTa as well with Byte-pair encoding for tokenizer over BERT's Wordpeice tokenizers. Consider the roberta-base model as a feature extractor from the HuggingFacetransformers package.

from transformers import RobertaTokenizer, RobertaModel
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaModel.from_pretrained('roberta-base')
text = "song lyrics in text."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

Tokenizers work at various text granularity level of syntax & semantics. They help generate quality vectors/embeddings. Each can yield different and better results if fine-tunned for the correct task and model.

Some other tokenizers you can consider are: Character Level BPE, Byte-Level BPE, WordPiece (BERT uses this), SentencePiece, and Unigram tokenizer with LM Character.

Also consider exploring the HuggingFace official Tokenizer Library guide here.

Nikhil Wani
  • 67
  • 1
  • 5
  • Right now we are doing sentence_embeddings = sbert_model.encode(lyrics1,lyrics2) cos_sim = cosine_similarity(sentence_embeddings) so how should we use those codes to put on top of what we have? – yyy818 May 06 '23 at 16:47
  • instead of .encode() use the embeddings from output variable from the Roberta code above and then pass it to your choice of cosine similarity. output['last_hidden_state'][0][0] should give you the features/embeddings. The code above is example for a single sentence. You can scale it for as many sentences/lyrics as you want. Does that make sense? – Nikhil Wani May 06 '23 at 16:51
  • So should I pass in output['last_hidden_state'][0][0] for lyrics1 and 2 to cosine_similarity(sentence_embeddings) instead of sentence_embeddings? Did I understand it correctly? – yyy818 May 06 '23 at 17:45
  • `lyrics1 = "this is the first song"` `lyrics2 = "this is the second song"` # roberta embeddings - `lyric1_emb = output['last_hidden_state'][0][0]` `lyric2_emb = output['last_hidden_state'][0][0]` Now that you have two roberta embeddings of both the lyrics you can find the cos_sim. `util.cos_sim(lyric1_emb, lyric2_emb)` This is considering you are passing strings individually to the transformer model. Does this answer your question? – Nikhil Wani May 06 '23 at 18:21
  • 1
    Yes! Thank you for the help! Just one more question tho, so I ran it and got a very high cos number like 0.996 and 0.995 for comparing a song with two different songs, is that what I should expect to get when I run this model? – yyy818 May 06 '23 at 18:42
  • Great! Yes, that means those two lyrics are similar. The original song would be more similar to the song with cos 0.996 than 0.995 If this answers the question, you can accept the answer above. Thanks. – Nikhil Wani May 06 '23 at 18:45