0

I have a function that calculate fuzzywuzzy score for two texts:

def fuzzywuzzy(text_1, text_2):

scores = {
        'ratio' : fuzz.ratio(tn.normalize_title(text_1),tn.normalize_title(text_2)) / 100,
        'partial_ratio' : fuzz.partial_ratio(tn.normalize_title(text_1),tn.normalize_title(text_2)) / 100,
        'token_sort_ratio' : fuzz.token_sort_ratio(tn.normalize_title(text_1),tn.normalize_title(text_2)) / 100,
        'token_set_ratio' : fuzz.token_set_ratio(tn.normalize_title(text_1),tn.normalize_title(text_2)) / 100}

return scores

As can be seen from the above code, I normalize text 1 and 2 before calculating the scores. The fuzzywuzzy function is called here:

event['scores'] = scores(v_ data['text1'], event['_source']['event_record']['text2'])

I need to modify the query and say if the value of fuzzy score of token_set_ratio is greater than 0.99, then return the scores. I am applying this code on 2000+ records.

Please save me with your ideas.

A J
  • 3,970
  • 14
  • 38
  • 53
SaNa
  • 333
  • 1
  • 3
  • 13
  • So do you want event['scores'] populated only with scores where score['token_set_ratio'] is > 0.99? – petre Dec 12 '18 at 06:04
  • Do you want to avoid computing all the other scores in that case? What is more exactly what you need? – petre Dec 12 '18 at 06:13
  • Yes, True. I want to populate event['scores'] with the information of the matching records (including their four fuzzy scores) that have the score['token_set_ratio'] > 0.99. In other words, when populating event['scores'], I want to skip records that have event['scores'] < 0.99. Hope my answer is clear. – SaNa Dec 12 '18 at 08:01

1 Answers1

1

If I understand correctly what you want to do, here's my suggestion:

def fuzzywuzzy(text_1, text_2, cutoff=0.99):
    token_set_ratio = fuzz.token_set_ratio(tn.normalize_title(text_1),tn.normalize_title(text_2)) / 100
    if token_set_ratio > cutoff:
        return {
            'ratio' : fuzz.ratio(tn.normalize_title(text_1),tn.normalize_title(text_2)) / 100,
            'partial_ratio' : fuzz.partial_ratio(tn.normalize_title(text_1),tn.normalize_title(text_2)) / 100,
            'token_sort_ratio' : fuzz.token_sort_ratio(tn.normalize_title(text_1),tn.normalize_title(text_2)) / 100,
            'token_set_ratio' : token_set_ratio}
    return None

Then you can do something like (assuming there's a list of events):

   for event in events:
       s = scores(...)
       if s:
           event['scores'] = s

And here's a more pythonic form of it:

import fuzz
import tn


def fuzzywuzzy(text_1, text_2, cutoff=0.99):
    def _compute_ratio(fn):
        return fn(tn.normalize_title(text_1), tn.normalize_title(text_2)) / 100

    token_set_ratio = _compute_ratio(fuzz.token_set_ratio)
    if token_set_ratio > cutoff:
        return {
            'ratio': _compute_ratio(fuzz.ratio),
            'partial_ratio': _compute_ratio(fuzz.partial_ratio),
            'token_sort_ratio': _compute_ratio(fuzz.token_sort_ratio),
            'token_set_ratio': token_set_ratio,
        }
    return None
petre
  • 1,485
  • 14
  • 24