Fuzzy match and get index of a pattern from a string

Question

I have a list of company names that I want to match against a list of sentences and get the index start and end position if a keyword is present in any of the sentences.

I wrote the code for matching the keywords exactly but realized that names in the sentences won't always be an exact match. For example, my keywords list can contain Company One Two Ltd but the sentences can be -

Company OneTwo Ltd won the auction
Company One Two Limited won the auction
The auction was won by Co. One Two Ltd and other variations

Given a company name, I want to find out the index start and end position even if the company name in the sentence is not an exact match but a variation. Below is the code I wrote for exact matching.

def find_index(texts, target):
    idxs = []
    for i, each_sent in enumerate(texts):
        add = [(m.start(0), m.end(0)) for m in re.finditer(target, each_sent)]
        if len(add):
            idxs.append([(i, m.start(0), m.end(0)) for m in re.finditer(target, each_sent)])
    return idxs

you may have to modify the target to be more versatile like `(Company|Co\.?)\s?One\s?Two\s?(Limited|Ltd)` — depperm, Aug 10 '21 at 16:13
@depperm, I have about 10k such company names with not much common in each other for which I am trying to get the indices. This might work for a few cases but it's not going to be possible to go manually through all possibilities. — Clock Slave, Aug 10 '21 at 16:17
while not clean you could iterate through company list and create fuzzy searches `Company`->`(Company|Co\.?)`, `' '`->`\s?`, `Limited`->`(Limited|Ltd)`, etc. It's hard to come up with possible solutions without knowing all the data — depperm, Aug 10 '21 at 16:21
another option is to create a levenshtein distance calculator, though it has its own drawbacks [example](https://www.datacamp.com/community/tutorials/fuzzy-string-python) — depperm, Aug 10 '21 at 16:39
@depperm, Levenshtein distance is what I found as well. Looking into it.. Thanks for the link — Clock Slave, Aug 10 '21 at 17:00
@depperm - how about you turn your comments into an answer so that clock-slave can accept? — sophros, Aug 16 '21 at 05:53

score 0 · Answer 1 · answered Aug 21 '21 at 15:12

I can think of 2-3 possibilities all with varying pros/cons:

Create More Versatile regex

(Company|Co\.?)\s?One\s?Two\s?(Limited|Ltd)

Building on the previous suggestion, iterate through company list and create fuzzy search

Company->(Company|Co\.?), ' '->\s?, imited->(Limited|Ltd), etc

Levenshtein distance calculator

example

which references external library fuzzywuzzy, there are alternatives fuzzy

Fuzzy match and get index of a pattern from a string

1 Answers1