3

I have a list of strings: my_list = ['policeman', 'police officers', 'police force',..].The length of list is around 2000. I want to group those words based on cosine similarity. If the cosine similairty is above 0.7, i want to group them together. A word that are already in the group should not appear in another group. Here is my code.

def subject_similarity_grouped(subj_list, threshold):
            
    embeddings = {word: nlp(word).vector for word in subj_list}
    
    # cosine similarity and grouping
    # create a list to hold the groups
    groups = []

    # iterate over each word in the list
    for word in subj_list:
        # check if the word is already in a group
        in_a_group = False
        for group in groups:
            if word in group:
                in_a_group = True
                break

        # if the word is not in a group, create a new group for it
        if not in_a_group:
            # create a new group with the current word
            new_group = [word]

            # retrieve the embedding for the current word
            embedding1 = embeddings[word]

            # iterate over the remaining words and add them to the current group if they are similar enough
            for other_word in subj_list:
                # skip the current word
                if other_word == word:
                    continue

                # check if the other word is already in a group
                in_a_group = False
                for group in groups:
                    if other_word in group:
                        in_a_group = True
                        break

                # if the other word is not in a group, retrieve its embedding and compute its similarity to the current word
                if not in_a_group:
                    embedding2 = embeddings[other_word]
                    similarity = cosine_similarity(embedding1.reshape(1, -1), embedding2.reshape(1, -1))[0][0]
                    # if the similarity is above the threshold, add the word to the current group
                    if similarity > threshold:
                        new_group.append(other_word)

            # add the new group to the list of groups
            groups.append(new_group)
            
    # remove if the grouped subject list only has one entity       
    groups = [lst for lst in groups if len(lst) > 1]
    return groups

this is my function and somehow policeman and police officers are not grouped together.

the output should be [['policeman', 'police officers', 'police force'], ['apple', 'banana','pineapple']]

Manfred
  • 2,269
  • 1
  • 5
  • 14
Fio
  • 31
  • 1
  • I didn't look at your code, but wonder whether your problem even has a well-defined solution. Imagine _n_ 2D vectors evenly distributed on a unit circle. How should your grouping be? Once you start with one vector, the next one will probably (i.e. for sufficiently large _n_) be similar enough. While the pair of adjacent ones will always be likewise similar, that's not necessarily true for the entire group consisting of all _n_ vectors. – Manfred Apr 15 '23 at 11:22
  • What is the output you are currently getting? – Grewal_Creator Apr 16 '23 at 23:55

0 Answers0