-2

I have a list of company names that I want to match against a list of sentences and get the index start and end position if a keyword is present in any of the sentences.

I wrote the code for matching the keywords exactly but realized that names in the sentences won't always be an exact match. For example, my keywords list can contain Company One Two Ltd but the sentences can be -

  • Company OneTwo Ltd won the auction
  • Company One Two Limited won the auction
  • The auction was won by Co. One Two Ltd and other variations

Given a company name, I want to find out the index start and end position even if the company name in the sentence is not an exact match but a variation. Below is the code I wrote for exact matching.

def find_index(texts, target):
    idxs = []
    for i, each_sent in enumerate(texts):
        add = [(m.start(0), m.end(0)) for m in re.finditer(target, each_sent)]
        if len(add):
            idxs.append([(i, m.start(0), m.end(0)) for m in re.finditer(target, each_sent)])
    return idxs
Clock Slave
  • 7,627
  • 15
  • 68
  • 109
  • 1
    you may have to modify the target to be more versatile like `(Company|Co\.?)\s?One\s?Two\s?(Limited|Ltd)` – depperm Aug 10 '21 at 16:13
  • @depperm, I have about 10k such company names with not much common in each other for which I am trying to get the indices. This might work for a few cases but it's not going to be possible to go manually through all possibilities. – Clock Slave Aug 10 '21 at 16:17
  • 2
    while not clean you could iterate through company list and create fuzzy searches `Company`->`(Company|Co\.?)`, `' '`->`\s?`, `Limited`->`(Limited|Ltd)`, etc. It's hard to come up with possible solutions without knowing all the data – depperm Aug 10 '21 at 16:21
  • 1
    another option is to create a levenshtein distance calculator, though it has its own drawbacks [example](https://www.datacamp.com/community/tutorials/fuzzy-string-python) – depperm Aug 10 '21 at 16:39
  • @depperm, Levenshtein distance is what I found as well. Looking into it.. Thanks for the link – Clock Slave Aug 10 '21 at 17:00
  • @depperm - how about you turn your comments into an answer so that clock-slave can accept? – sophros Aug 16 '21 at 05:53

1 Answers1

0

I can think of 2-3 possibilities all with varying pros/cons:

  1. Create More Versatile regex

(Company|Co\.?)\s?One\s?Two\s?(Limited|Ltd)

  1. Building on the previous suggestion, iterate through company list and create fuzzy search

Company->(Company|Co\.?), ' '->\s?, imited->(Limited|Ltd), etc

  1. Levenshtein distance calculator

    example

which references external library fuzzywuzzy, there are alternatives fuzzy

depperm
  • 10,606
  • 4
  • 43
  • 67