3

I have created an object called Issuer, which contains a member named issuer_name.

I want to take advantage of fuzzywuzzy's process.extract() function, but it only takes in a list of strings. My goal is to find matches and return the list of objects that match by the issuer_name.

I came up with this method below, but it's running really slow. The issuers list contains over 100,000 elements.

# (string, list of issuers , integer)
def fuzzyMatchWordToIssuers(word, issuers, threshold):
    limit = 5
    count = 0
    res = []
    for issuer in issuers:
        calc = fuzz.token_set_ratio(word,issuer.issuer_name)
        if calc >= threshold:
            res.append(issuer)
            count += 1
        if count == limit:
            return res
    return res

Is it possible to use the process.extract() somehow, or speed this up?

For reference, here's the github example:

process.extract("new york jets", choices, limit=2)
martineau
  • 119,623
  • 25
  • 170
  • 301
firstblud
  • 135
  • 3
  • 11

1 Answers1

4

Preface

My solution was tested for correctness. I need to be able to search on a list of objects and that is the solution that worked for me. However, my solution was not tested for performance, nor do I care about it since my data sets are rather small. For large datasets I strongly recommend using a 3rd party tool, probably some cloud based search tool will be scalable and with reasonable performance.

Solution

fuzzywuzzy process.extract apparently can handle a dictionary where only values are being searched and the result is a list of tuples with the following structure

(query, score, key)

Where query and score are the same as using extract with lists and key is the key that matched the string value (values must still be strings only). So, you will need to create a processed issuer names dictionary with keys as indexes like so

issuer_names_dict = dict(enumerate([issuer.name for issuer in issuers]))

Then you can pass this dictionary to process.extract (I think you should use extractBests since you are using a cutoff threshold)

best_issuers = process.extractBests(word, issuer_names_dict, score_cutoff=threshold, limit=5)

Finally you will need to assemble the result list

res = [issuers[z] for (x,y,z) in best_issuers]
Uri Brecher
  • 437
  • 2
  • 11