1

I have a list of several thousands locations and a list of millions of sentences. My objective is to return a list of tuples that report the comment that was matched and the location mentioned within the comment. For example:

locations = ['Turin', 'Milan']
state_init = ['NY', 'OK', 'CA']

sent = ['This is a sent about turin. ok?', 'This is a sent about milano.' 'Alan Turing was not from the state of OK.'

result = [('Turin', 'This is a sent about turin. ok?'), ('Milan', 'this is a sent about Melan'), ('OK', 'Alan Turing was not from the state of OK.')]

In words, I do not want to match on locations embedded within other words, I do not want to match state initials if they are not capitalized. If possible, I would like to catch misspellings or fuzzy matches of locations that either omit a correct letter, replace one correct letter with an incorrect letter or have one error in the ordering of all of the correct letters. For example:

Milan

should match

Melan, Mlian, or Mlan but not Milano 

The below function works very well at doing everything except the fuzzy matching and returning a tuple but I do not know how to do either of these things without a for loop. Not that I am against using a for loop but I still would not know how to implement this in a way that is computationally efficient.

Is there a way to add these functionalities that I am interested in having or am I trying to do too much in a single function?

def find_keyword_comments(sents, locations, state_init):
    keywords = '|'.join(locations)
    keywords1 = '|'.join(state_init)
    word = re.compile(r"^.*\b({})\b.*$".format(locations), re.I)
    word1 = re.compile(r"^.*\b({})\b.*$".format(state_init))
    newlist = filter(word.match, test_comments)
    newlist1 = filter(word1.match, test_comments)
    final = list(newlist) + list(newlist1)
    return final
JRR
  • 578
  • 5
  • 21
  • Don't assume that the lack of a for loop makes it fast. Regexes can be pretty expensive. It would probably help if you defined what you consider to be a fuzzy match. A good start would probably be 'at most one letter changed or one pair of consecutive letters swapped'. Even that is pretty generous, e.g. 'parks' will match 'Paris'. So maybe it has to be an exact match when the location name is short. – Alex Hall May 11 '18 at 15:57
  • Thank you for the feedback! I have amended the description to reflect what I am looking for in a near-match. Hopefully this is more helpful. I also totally understand that I am likely misunderstanding the computation time associated with some of these procedures and would be open to anything that works. Thanks again! – JRR May 11 '18 at 16:09
  • Have you tried running the code with your full data to see how long it takes? Is it acceptably fast? Can you share your data with us? Also try making your regex just `r"\b({})\b"` and change the filter functions to `word.search`. – Alex Hall May 11 '18 at 16:14
  • Also are your location names all composed of only letters, or do some contain punctuation or spaces? – Alex Hall May 11 '18 at 16:20
  • Not able to share data but it takes over approximately 30min. to run the above function. When I added a for loop to add a tuple the run time jumped to >3hrs. Over 90 percent of locations only include letters but some have spaces, a dash or a period. I will try running the regex you mention above now. – JRR May 11 '18 at 16:29

1 Answers1

2

I would recommend you look at metrics for fuzzy matching, mainly the one you are interested is Levenshtein Distance (sometimes called the edit distance).

Here are some implementations in pure python, but you can leverage a few modules to make your life easier:

  • fuzzywuzzy is a very common (pip-installable) package which implements this distance for what they call the pure ratio. It provides a bit more functionality than you are maybe looking for (partial string matching, ignoring punctuation marks, token order insensitivity...). The only drawback is that the ratio takes into account the length of the string as well. See this response for further basic usage

    from fuzzywuzzy import fuzz
    fuzz.ratio("this is a test", "this is a test!")  # 96
    
  • python-Levenshtein is a pretty fast package because it is basically a wrapper in python to the C library underneath. The documentation is not the nicest, but should work. It is now back in the PyPI index so it is pip installable.

WillMonge
  • 1,005
  • 8
  • 19