Calculating semantic coherence in a given speech transcript

Question

I am trying to calculate the semantic coherence in a given paragraph/transcript, ie. if somebody goes off track while talking about a thing or topic - more specifically describing a picture (the picture might have many sub details).

For example -

Transcript 1: I like to play sports. There are so many sports fans in the world.

Transcript 2: I like to play sports. There is a deadly virus spreading across the world.

Semantic coherence should be high for Transcript 1 and low for Transcript 2. I am using BERT (bert-as-service) to generate sentence embeddings for the sentences. I then try to compare sentence i and i+1 in a given transcript by calculating the cosine similarity between the sentence embedding vectors. I have also tried using a sliding window, with and without overlap to calculate cosine similarity.

The problem I am running into is, that the cosine similarities are very close for two sentences, for example the examples above whereas I would expect a greater difference between the two.

I am thinking of using an LSA Model trained on Wikipedia data next to see if I can see better differentiation. Is there a better method of doing this?

Err, your example “sentences” actually both consist of two sentences. Do you mean “paragraph”? — DisappointedByUnaccountableMod, Mar 03 '20 at 22:23
It has two sentences because I want to calculate the semantic coherence between sentence1[i] and sentence1[i+1] in a paragraph/transcript. It could also just be a window of 5 tokens rather than the full sentence. I am not trying to calculate coherence between sentence 1 and sentence 2. I made some edits to OP, hope that helps. — Samarth, Mar 03 '20 at 22:28
I do NOT have any labels for the semantic similarity of the transcripts. I am trying to do it in an unsupervised fashion by using a pre-trained model - be it BERT or LSA. I understand that it is one of the reasons for such poor performance as there is no fine tuning involved. Just trying to explore my options to calculate a similarity measure without any custom data. — Samarth, Mar 04 '20 at 16:26

score 0 · Answer 1 · answered Jan 22 '23 at 01:19

Expanding on SentenceTransformers, the following code solves the problem in very few lines:

from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = [
    'I like to play sports', 
    'There are so many sports fans in the world.',
    'There is a deadly virus spreading across the world.'
]

embeddings = model.encode(sentences)

print(f'Similarity between "{sentences[0]}" and:')
for i in [1, 2]:
    similarity = util.cos_sim(embeddings[0], embeddings[i])[0][0]
    print(f'- {sentences[i]} \t=> {similarity}')

This outputs:

Similarity between "I like to play sports" and:
- There are so many sports fans in the world.         => 0.5119736790657043
- There is a deadly virus spreading across the world. => 0.0683723688125610

Calculating semantic coherence in a given speech transcript

1 Answers1