Here is a variant of the given solutions that also optimizes the
global minimum distance. It uses the Munkres assignment algorithm
to ensure that the string pairings are optimal.
from munkres import Munkres
def match_lists(l1, l2):
# Compute a matrix of string distances for all combinations of
# items in l1 and l2.
matrix = [[levenshtein(i1, i2) for i2 in l2] for i1 in l1]
# Now figure out what the global minimum distance between the
# pairs is.
indexes = Munkres().compute(matrix)
for row, col in indexes:
yield l1[row], l2[col]
l1 = [
'bolton',
'manchester city',
'manchester united',
'wolves',
'liverpool',
'sunderland',
'wigan',
'norwich',
'arsenal',
'aston villa',
'chelsea',
'fulham',
'newcastle utd',
'stoke city',
'everton',
'tottenham',
'blackburn',
'west brom',
'qpr',
'swansea'
]
l2 = [
'bolton wanderers',
'manchester city',
'manchester united',
'wolverhampton',
'liverpool',
'norwich city',
'sunderland',
'wigan athletic',
'arsenal',
'aston villa',
'chelsea',
'fulham',
'newcastle united',
'stoke city',
'everton',
'tottenham hotspur',
'blackburn rovers',
'west bromwich',
'queens park rangers',
'swansea city'
]
for i1, i2 in match_lists(l1, l2):
print i1, '=>', i2
For the lists given, where the differences more stems from alternative
spellings and nicknames rather than spelling errors, this method gives better results than just
using levenshtein or difflib. The munkres module can be found here:
http://software.clapper.org/munkres/