7

Given are two python lists with strings in them (names of persons):

list_1 = ['J. Payne', 'George Bush', 'Billy Idol', 'M Stuart', 'Luc van den Bergen']
list_2 = ['John Payne', 'George W. Bush', 'Billy Idol', 'M. Stuart', 'Luc Bergen']

I want a mapping of the names, that are most similar.

'J. Payne'           -> 'John Payne'
'George Bush'        -> 'George W. Bush'
'Billy Idol'         -> 'Billy Idol'
'M Stuart'           -> 'M. Stuart'
'Luc van den Bergen' -> 'Luc Bergen'

Is there a neat way to do this in python? The lists contain in average 5 or 6 Names. Sometimes more, but this is seldom. Sometimes it is just one name in every list, which could be spelled slightly different.

Aufwind
  • 25,310
  • 38
  • 109
  • 154
  • 1
    What is your algorithmic definition of "most similar?" – cdhowie Aug 15 '11 at 06:52
  • @cdhowie: Different spelling of names, abbreviation of names, optional availability of middlewords like the belgian "van", optional middlenames. I don't know how to define that in an algorithmic fashion. I want to map those names, whose spellings are closest. – Aufwind Aug 15 '11 at 06:55
  • 1
    In order to do this, you need to convert your idea about "closeness" of names into a function you can apply to two strings. Computers don't deal with vague specifications; they deal with math. :) – cdhowie Aug 15 '11 at 06:57
  • @cdhowie Thanks for the advice. I hoped for a python module which is already capable of doing this, since I don't want to reinvent the wheel. The `difflib` module mentioned below for example. But you made a point there about *math* and *computers*. :-) – Aufwind Aug 15 '11 at 07:11
  • Are the lists always the same size and is there always exactly one match in list_2 for each item in list_1? If so, the distance matching can be improved considerably. – Björn Lindqvist Aug 15 '11 at 08:49
  • @Björn: I can't guaranty, that both criteria are always fulfilled. But assume they are. How dose the improvement look like? I am curious. :-) So if you have the time to explain, I am looking forward to understand. – Aufwind Aug 15 '11 at 09:44

3 Answers3

11

Using the function defined here: http://hetland.org/coding/python/levenshtein.py

>>> for i in list_1:
...     print i, '==>', min(list_2, key=lambda j:levenshtein(i,j))
... 
J. Payne ==> John Payne
George Bush ==> George W. Bush
Billy Idol ==> Billy Idol
M Stuart ==> M. Stuart
Luc van den Bergen ==> Luc Bergen

You could use functools.partial instead of the lambda

>>> from functools import partial
>>> for i in list_1:
...     print i, '==>', min(list_2, key=partial(levenshtein,i))
...
J. Payne ==> John Payne
George Bush ==> George W. Bush
Billy Idol ==> Billy Idol
M Stuart ==> M. Stuart
Luc van den Bergen ==> Luc Bergen
John La Rooy
  • 295,403
  • 53
  • 369
  • 502
  • 1
    What is the main difference between your *levenstein* function and the `difflib.get_closest_matches()` approach of @jellybean? – Aufwind Aug 15 '11 at 08:33
  • @Aufwind, I think difflib uses quite a different algorithm. The help says it uses the SequenceMatcher. It's hard to know for sure which algorithm will be better without knowing the data it will be used on. – John La Rooy Aug 15 '11 at 10:34
10

You might try difflib:

import difflib

list_1 = ['J. Payne', 'George Bush', 'Billy Idol', 'M Stuart', 'Luc van den Bergen']
list_2 = ['John Payne', 'George W. Bush', 'Billy Idol', 'M. Stuart', 'Luc Bergen']

mymap = {}
for elem in list_1:
    closest = difflib.get_close_matches(elem, list_2)
    if closest:
        mymap[elem] = closest[0]

print mymap

output:

{'George Bush': 'George W. Bush', 
 'Luc van den Bergen': 'Luc Bergen', 
 'Billy Idol': 'Billy Idol', 
 'J. Payne': 'John Payne', 
 'M Stuart': 'M. Stuart'}
Johannes Charra
  • 29,455
  • 6
  • 42
  • 51
2

Here is a variant of the given solutions that also optimizes the global minimum distance. It uses the Munkres assignment algorithm to ensure that the string pairings are optimal.

from munkres import Munkres
def match_lists(l1, l2):
    # Compute a matrix of string distances for all combinations of
    # items in l1 and l2.
    matrix = [[levenshtein(i1, i2) for i2 in l2] for i1 in l1]

    # Now figure out what the global minimum distance between the
    # pairs is.
    indexes = Munkres().compute(matrix)
    for row, col in indexes:
        yield l1[row], l2[col]

l1 = [
    'bolton',
    'manchester city',
    'manchester united',
    'wolves',
    'liverpool',
    'sunderland',
    'wigan',
    'norwich',
    'arsenal',
    'aston villa',
    'chelsea',
    'fulham',
    'newcastle utd',
    'stoke city',
    'everton',
    'tottenham',
    'blackburn',
    'west brom',
    'qpr',
    'swansea'
    ]
l2 = [
    'bolton wanderers',
    'manchester city',
    'manchester united',
    'wolverhampton',
    'liverpool',
    'norwich city',
    'sunderland',
    'wigan athletic',
    'arsenal',
    'aston villa',
    'chelsea',
    'fulham',
    'newcastle united',
    'stoke city',
    'everton',
    'tottenham hotspur',
    'blackburn rovers',
    'west bromwich',
    'queens park rangers',
    'swansea city'
    ]
for i1, i2 in match_lists(l1, l2):
    print i1, '=>', i2

For the lists given, where the differences more stems from alternative spellings and nicknames rather than spelling errors, this method gives better results than just using levenshtein or difflib. The munkres module can be found here: http://software.clapper.org/munkres/

Björn Lindqvist
  • 19,221
  • 20
  • 87
  • 122