How to eliminate the duplicate string from the list based on similarity score calculated with fuzzywuzzy ratio?

Question

Let's say there are 4 lists:

1) [12b, shanti vihar, 12b shanti bihar, 201 Anupam residency, 401 enclaves]
2) [12b, shanti vihar, 12b shanti bihar, 12b shanti bihar, 401 enclaves]
3) [12b, shanti vihar, 12b shanti vihar, 12b shanti bihar, 12b shanti bihar]
4) [12b, shanti vihar, 12b rue de Paris road, 201 Anupam residency, 401 enclaves]

After inserting these 4 lists to the fuzzymatch function it should delete duplicate string based on fuzzy score(more than 90%) and return:

1) [12b, shanti vihar, 201 Anupam residency, 401 enclaves]
2) [12b, shanti vihar,401 enclaves]
3) [12b, shanti vihar]
4) [12b, shanti vihar, 12b rue de Paris road, 201 Anupam residency, 401 enclaves]

To make it more clear in case 1) [12b, shanti vihar, 12b shanti bihar, 201 Anupam residency, 401 enclaves] 12b, shanti vihar and 12b shanti bihar are duplicates(same address or same meaning ) that is why they will have a higher fuzzy similarity score (more than 90%) and others will have lower score because they are different. so I need to keep only one out of two in the final output that is [12b, shanti vihar, 201 Anupam residency, 401 enclaves]. similarly in case 3) all the addresses are the same so I need only one address in the final output: [12b, shanti vihar].

so I was trying to implement this but I am not sure if I am doing it in the correct way:

def fuzzydeduplicate(list_address):
    
    
    list=[]
    for i in list_address:
        
       
        add_list=process.extract(i, list_address, scorer=fuzz.token_set_ratio)
       
            
        list.append(add_list)
    return list

After calling this function I am getting output as:

[[('12b, shanti vihar', 100), ('12b, shanti bihar', 94), ('301 anupam residency', 28), ('13x', 11)], [('12b, shanti bihar', 100), ('12b, shanti vihar', 94), ('301 anupam residency', 28), ('13x', 11)], [('13x', 100), ('12b, shanti vihar', 11), ('12b, shanti bihar', 11), ('301 anupam residency', 9)], [('301 anupam residency', 100), ('12b, shanti vihar', 33), ('12b, shanti bihar', 28), ('13x', 9)]]

From here I want to eliminate duplicate strings based on a similarity score(more than 90%) and get the desired output.

Can anyone please help me? in implementing this?

score 0 · Answer 1 · answered Feb 12 '21 at 09:03

import pandas as pd
from fuzzywuzzy import fuzz

elements = ['12b, shanti vihar','12b, shanti bihar','401 enclaves','301 anupam residency']


remove = []
for (i, element) in enumerate(elements):
    #print(element)
    #print(i)
    for (j, choice) in enumerate(elements[i+1:]):
        #print(choice)
        #print(j)
        if fuzz.ratio(element, choice) >= 90:
            print('duplicate index ='+ str(j+i+1))
            remove.append(j+i+1)
            #if choice not in remove:
                
                #remove.append(choice)
print('///////')
remove_2 = []
for string in set(remove):
    
    print(string)
    print(elements[string])
    remove_2.append(elements[string])
    #del elements[string]
for i in remove_2:
    elements.remove(i)

How to eliminate the duplicate string from the list based on similarity score calculated with fuzzywuzzy ratio?

1 Answers1