I am trying to create a dictionary of some kind to append my results and get the best match using the jaro distance function.
This is part of my attempt to match 2 lists and get the best matched name in both.
Example:
import jellyfish
jellyfish.jaro_distance(u'jellyfish', u'sellyfish')
output:
0.9259259259259259
What I am trying to do is:
listA = ['grellofish','mellofush','jellyfihs','sellyfish','salmonfish']
listB = ['jellyfish','salmonfish']
#convert to unicode
listA = [unicode(i) for i in listA]
listB = [unicode(i) for i in listB]
for nickB in listB:
for nickA in listA:
results = jellyfish.jaro_distance(nickA, nickB)
print nickB,nickA,results
output:
jellyfish grellofish 0.825925925926
jellyfish mellofush 0.777777777778
jellyfish jellyfihs 0.962962962963
jellyfish sellyfish 0.925925925926
jellyfish salmonfish 0.685185185185
salmonfish grellofish 0.733333333333
salmonfish mellofush 0.7
salmonfish jellyfihs 0.618518518519
salmonfish sellyfish 0.755555555556
salmonfish salmonfish 1.0
In this case I want it to return the 2 with the highest score:
jellyfish jellyfihs 0.962962962963
salmonfish salmonfish 1.0
For FuzzyWuzzy users, I am trying to emulate the process.extractOne
function where you can pass a list into process.extractOne(<value you want to compare>,<list of items you want to compare>)
and you will get the best match.
Reason why I am not using FuzzyWuzzy is just cause processing is too slow and I am unsure what is happening behind, a match for 5000 strings comparing to another list of 5000 strings take up to 40 minutes.