5

I am trying to create a dictionary of some kind to append my results and get the best match using the jaro distance function.

This is part of my attempt to match 2 lists and get the best matched name in both.

Example:

import jellyfish
jellyfish.jaro_distance(u'jellyfish', u'sellyfish')

output: 
0.9259259259259259

What I am trying to do is:

listA = ['grellofish','mellofush','jellyfihs','sellyfish','salmonfish']
listB = ['jellyfish','salmonfish']

#convert to unicode
listA = [unicode(i) for i in listA]
listB = [unicode(i) for i in listB]

for nickB in listB:
    for nickA in listA:
        results = jellyfish.jaro_distance(nickA, nickB)
        print nickB,nickA,results

output:
jellyfish grellofish 0.825925925926
jellyfish mellofush 0.777777777778
jellyfish jellyfihs 0.962962962963
jellyfish sellyfish 0.925925925926
jellyfish salmonfish 0.685185185185
salmonfish grellofish 0.733333333333
salmonfish mellofush 0.7
salmonfish jellyfihs 0.618518518519
salmonfish sellyfish 0.755555555556
salmonfish salmonfish 1.0

In this case I want it to return the 2 with the highest score:

jellyfish jellyfihs 0.962962962963
salmonfish salmonfish 1.0

For FuzzyWuzzy users, I am trying to emulate the process.extractOne function where you can pass a list into process.extractOne(<value you want to compare>,<list of items you want to compare>) and you will get the best match.

Reason why I am not using FuzzyWuzzy is just cause processing is too slow and I am unsure what is happening behind, a match for 5000 strings comparing to another list of 5000 strings take up to 40 minutes.

BernardL
  • 5,162
  • 7
  • 28
  • 47

1 Answers1

2

This might solve your problem:

def get_closest_match(x, list_random):
    best_match = None
    highest_jaro_wink = 0
    for current_string in list_random:
        current_score = jf.jaro_winkler(x, current_string)
        if(current_score > highest_jaro_wink):
            highest_jaro_wink = current_score
            best_match = current_string
    return best_match
for nickB in listB:
    result = get_closest_match(nickB,listA)
    print nickB, result
Nick is tired
  • 6,860
  • 20
  • 39
  • 51