0

I am trying to calculate the semantic coherence in a given paragraph/transcript, ie. if somebody goes off track while talking about a thing or topic - more specifically describing a picture (the picture might have many sub details).

For example -

Transcript 1: I like to play sports. There are so many sports fans in the world.

Transcript 2: I like to play sports. There is a deadly virus spreading across the world.

Semantic coherence should be high for Transcript 1 and low for Transcript 2. I am using BERT (bert-as-service) to generate sentence embeddings for the sentences. I then try to compare sentence i and i+1 in a given transcript by calculating the cosine similarity between the sentence embedding vectors. I have also tried using a sliding window, with and without overlap to calculate cosine similarity.

The problem I am running into is, that the cosine similarities are very close for two sentences, for example the examples above whereas I would expect a greater difference between the two.

I am thinking of using an LSA Model trained on Wikipedia data next to see if I can see better differentiation. Is there a better method of doing this?

Samarth
  • 242
  • 2
  • 12
  • Err, your example “sentences” actually both consist of two sentences. Do you mean “paragraph”? – DisappointedByUnaccountableMod Mar 03 '20 at 22:23
  • It has two sentences because I want to calculate the semantic coherence between sentence1[i] and sentence1[i+1] in a paragraph/transcript. It could also just be a window of 5 tokens rather than the full sentence. I am not trying to calculate coherence between sentence 1 and sentence 2. I made some edits to OP, hope that helps. – Samarth Mar 03 '20 at 22:28
  • do you have labeled training data? (and how much?) – chefhose Mar 04 '20 at 12:18
  • I do NOT have any labels for the semantic similarity of the transcripts. I am trying to do it in an unsupervised fashion by using a pre-trained model - be it BERT or LSA. I understand that it is one of the reasons for such poor performance as there is no fine tuning involved. Just trying to explore my options to calculate a similarity measure without any custom data. – Samarth Mar 04 '20 at 16:26

1 Answers1

0

Expanding on SentenceTransformers, the following code solves the problem in very few lines:

from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = [
    'I like to play sports', 
    'There are so many sports fans in the world.',
    'There is a deadly virus spreading across the world.'
]

embeddings = model.encode(sentences)

print(f'Similarity between "{sentences[0]}" and:')
for i in [1, 2]:
    similarity = util.cos_sim(embeddings[0], embeddings[i])[0][0]
    print(f'- {sentences[i]} \t=> {similarity}')

This outputs:

Similarity between "I like to play sports" and:
- There are so many sports fans in the world.         => 0.5119736790657043
- There is a deadly virus spreading across the world. => 0.0683723688125610
Carlos Souza
  • 351
  • 6
  • 13