Let's say there are 4 lists:
1) [12b, shanti vihar, 12b shanti bihar, 201 Anupam residency, 401 enclaves]
2) [12b, shanti vihar, 12b shanti bihar, 12b shanti bihar, 401 enclaves]
3) [12b, shanti vihar, 12b shanti vihar, 12b shanti bihar, 12b shanti bihar]
4) [12b, shanti vihar, 12b rue de Paris road, 201 Anupam residency, 401 enclaves]
After inserting these 4 lists to the fuzzymatch function it should delete duplicate string based on fuzzy score(more than 90%) and return:
1) [12b, shanti vihar, 201 Anupam residency, 401 enclaves]
2) [12b, shanti vihar,401 enclaves]
3) [12b, shanti vihar]
4) [12b, shanti vihar, 12b rue de Paris road, 201 Anupam residency, 401 enclaves]
To make it more clear in case 1) [12b, shanti vihar, 12b shanti bihar, 201 Anupam residency, 401 enclaves] 12b, shanti vihar and 12b shanti bihar are duplicates(same address or same meaning ) that is why they will have a higher fuzzy similarity score (more than 90%) and others will have lower score because they are different. so I need to keep only one out of two in the final output that is [12b, shanti vihar, 201 Anupam residency, 401 enclaves]. similarly in case 3) all the addresses are the same so I need only one address in the final output: [12b, shanti vihar].
so I was trying to implement this but I am not sure if I am doing it in the correct way:
def fuzzydeduplicate(list_address):
list=[]
for i in list_address:
add_list=process.extract(i, list_address, scorer=fuzz.token_set_ratio)
list.append(add_list)
return list
After calling this function I am getting output as:
[[('12b, shanti vihar', 100), ('12b, shanti bihar', 94), ('301 anupam residency', 28), ('13x', 11)], [('12b, shanti bihar', 100), ('12b, shanti vihar', 94), ('301 anupam residency', 28), ('13x', 11)], [('13x', 100), ('12b, shanti vihar', 11), ('12b, shanti bihar', 11), ('301 anupam residency', 9)], [('301 anupam residency', 100), ('12b, shanti vihar', 33), ('12b, shanti bihar', 28), ('13x', 9)]]
From here I want to eliminate duplicate strings based on a similarity score(more than 90%) and get the desired output.
Can anyone please help me? in implementing this?