I have a list of strings: my_list = ['policeman', 'police officers', 'police force',..]
.The length of list is around 2000. I want to group those words based on cosine similarity. If the cosine similairty is above 0.7, i want to group them together. A word that are already in the group should not appear in another group. Here is my code.
def subject_similarity_grouped(subj_list, threshold):
embeddings = {word: nlp(word).vector for word in subj_list}
# cosine similarity and grouping
# create a list to hold the groups
groups = []
# iterate over each word in the list
for word in subj_list:
# check if the word is already in a group
in_a_group = False
for group in groups:
if word in group:
in_a_group = True
break
# if the word is not in a group, create a new group for it
if not in_a_group:
# create a new group with the current word
new_group = [word]
# retrieve the embedding for the current word
embedding1 = embeddings[word]
# iterate over the remaining words and add them to the current group if they are similar enough
for other_word in subj_list:
# skip the current word
if other_word == word:
continue
# check if the other word is already in a group
in_a_group = False
for group in groups:
if other_word in group:
in_a_group = True
break
# if the other word is not in a group, retrieve its embedding and compute its similarity to the current word
if not in_a_group:
embedding2 = embeddings[other_word]
similarity = cosine_similarity(embedding1.reshape(1, -1), embedding2.reshape(1, -1))[0][0]
# if the similarity is above the threshold, add the word to the current group
if similarity > threshold:
new_group.append(other_word)
# add the new group to the list of groups
groups.append(new_group)
# remove if the grouped subject list only has one entity
groups = [lst for lst in groups if len(lst) > 1]
return groups
this is my function and somehow policeman and police officers are not grouped together.
the output should be [['policeman', 'police officers', 'police force'], ['apple', 'banana','pineapple']]