0

Suppose I have a list of Person objects :

class Person:
  def __init__(self, name, items):
    self.name = name
    self.items = items

I want to remove duplicates in the following way. If the two persons have a name similar enough, as evaluated by this function :

def have_similar_names(person1, person2):
    ...

(suppose this function is already coded and uses the edit distance from the fuzzywuzzy package ; for example it would return True for arguments "Tomas" and "Tomàs" and False for "Catherine" and "Cathleen"), then combine the two persons using :

def combine_persons(person1, person2):
    return Person(max([person1.name, person2.name], key=len), person1.items+person2.items)

My question is on how to create a function that would take the list with fuzzy duplicates in input and output a list of combined persons.

I could do it with loops, but I wonder if there is a more efficient and pythonic way to achieve this?

Undead
  • 183
  • 8
  • 1
    post a set of "similar" persons – RomanPerekhrest Dec 13 '19 at 17:35
  • That doesn't sound like a good idea unless `have_similar_names` is transitive (which is unlikely). – Stefan Pochmann Dec 13 '19 at 17:35
  • It really depends on your definition of "similar". [Fuzzy String Matching in Python](https://www.datacamp.com/community/tutorials/fuzzy-string-python) might give you some ideas. – ytu Dec 13 '19 at 17:39
  • I updated the question. It is NOT about the fuzzy comparaison in itself but on the way to clean the duplicates out of the list. Also, this is just a dummy example, don't take it too seriously. – Undead Dec 13 '19 at 17:44
  • 1
    I guess the first step for you is to define what a fuzzy similarity is. What's you similarity metric? Hamming Distance? Edit Distance? Or some kind of pseudo unicode distance? – Srini Dec 13 '19 at 17:54
  • The similarity function is not the problem here. Suppose that it is is already coded and works properly. How can I process the data without having to loop through everything twice is what I'm asking – Undead Dec 13 '19 at 18:06
  • Just to confirm, the `Person` objects are in plain Python lists? Sorry if everyone seems a bit confused as to what the issues actually is, we’re probably all a bit tired ;) Is `have_similar_names()` a static method of the `Person` class? What about `combine_persons()`? They could be both static methods **and** instance methods, actually. – AMC Dec 13 '19 at 18:48
  • 1
    It might be useful to know how the fuzzy matching is being done, since in order to avoid destroying performance any solution will have to rely on certain properties of the matching algorithm. You mentioned this is a dummy example, I hope you can share at least part of the real thing. – AMC Dec 13 '19 at 18:55
  • @Alexander Cécile No problem, the question probably wasn't clear enough! Let's say they are static methods or even plain functions. – Undead Dec 13 '19 at 18:57
  • The matching could be done using the edit distance from the fuzzywuzzy package for example. I added that in the question. – Undead Dec 13 '19 at 19:03
  • 1
    @Undead Wonderful (and adorable), I’ll take a look. It seems it uses a few different libraries, lucky me, I get to learn all about those, too ;) – AMC Dec 13 '19 at 19:13
  • @Undead Do you have any test/example data? – AMC Dec 15 '19 at 01:27

0 Answers0