0

I have a python list. In this list I need to compare every item against the others and replace the shorter strings with the longest ones.

EDIT: I have a list of peoples names that I get using the Spacy module and it's entity extraction. I get back a list where sometimes it's the full name, sometimes part of the name. I want to normalize this list so it's always the full name (or the longest name in the article). This will help me determine who the most prominent/mentioned person in the article is.

small_example = ['David', 'David Stevens', 'Steve Martin' ]
small_example_outcome = [ 'David Stevens','David Stevens', 'Steve Martin'] 

Full Example:

person_list = [ 'Omarosa Manigault Newman', 'Manigault Newman','Trump', 'Apprentice', 'Mark Burnett', Manigault Newman','TAPES', 'Omarosa', 'Donald J. Trump','Omarosa', 'Donald J. Trump', 'Jacques Derrida', 'Derrida', 'Sigmund Freud', 'Mark Burnett', 'Manigault Newman', 'Manigault Newman', 'Trump', 'Mark Burnett' ]

Ideally what I'd have in the end is:       
corrected_list = [ 'Omarosa Manigault Newman', 'Omarosa Manigault Newman', 'Donald J. Trump', 'Apprentice', 'Mark Burnett', 'Omarosa Manigault Newman', 'TAPES', 'Omarosa', 'Donald J. Trump', 'Omarosa Manigault Newman', 'Donald J. Trump', 'Jacques Derrida', 'Jacques Derrida', 'Sigmund Freud', 'Mark Burnett', 'Omarosa Manigault Newman', 'Omarosa Manigault Newman', 'Donald J. Trump', 'Mark Burnett' ]

But a list like this would work too:

normalized_list = ['Omarosa Manigault Newman', 'Apprentice', 'Mark Burnett', 'TAPES', 'Jacques Derrida', 'Donald J. Trump', 'Sigmund Freud']
  • 1
    What exactly does "match" mean? Does it mean that the string is an exact substring of another string in the list? – abarnert Aug 17 '18 at 01:29
  • I have a list of peoples names from entity extraction. I get back a list where sometimes it's the first & last name, sometimes it's just the first and others it's just the last. I want to normalize this list so it's always "First & Last name" – dustin williams Aug 17 '18 at 01:33
  • Also, your example is invalid. I think it's just missing a `'` somewhere. – abarnert Aug 17 '18 at 01:33
  • First, put your clarification into the question. Second, your clarification doesn't give the output you say you want. `Omarosa Manigault Newman` is not "First & Last name", it's got a middle name as well. And this isn't just nitpicking—I have no idea whether, say, `Omarosa Newman` should match `Omarosa Manigault Newman` even though it's not a substring the way `Omarosa` should. (Naively, I wouldn't expect `John Adams` and `John Quincy Adams` to be matched, but I would expect `William Clinton` and `William Jefferson Clinton`, and I can't think of a rule that handles that without serious AI…) – abarnert Aug 17 '18 at 01:35
  • Thank you for this clarifying response. I wasn't paying attention when I posted the example that it's a "Full" name not just a "First & Last" name. – dustin williams Aug 17 '18 at 02:10

1 Answers1

1

I think what you're looking for is whether each string is a substring of another string in the list?

If the list is pretty short, like this one, we can do that with a stupid quadratic search:

corrected_list = []
for person in person_list:
    matches = (other for other in person_list if person in other)
    longest = max(matches, key=len)
    corrected_list.append(longest)

If your list were huge, this would be too slow, and we'd need to do something cleverer, like building prefix and suffix tries. But for something this small, I think that's overkill.

abarnert
  • 354,177
  • 51
  • 601
  • 671