I have a list of several thousands locations and a list of millions of sentences. My objective is to return a list of tuples that report the comment that was matched and the location mentioned within the comment. For example:
locations = ['Turin', 'Milan']
state_init = ['NY', 'OK', 'CA']
sent = ['This is a sent about turin. ok?', 'This is a sent about milano.' 'Alan Turing was not from the state of OK.'
result = [('Turin', 'This is a sent about turin. ok?'), ('Milan', 'this is a sent about Melan'), ('OK', 'Alan Turing was not from the state of OK.')]
In words, I do not want to match on locations embedded within other words, I do not want to match state initials if they are not capitalized. If possible, I would like to catch misspellings or fuzzy matches of locations that either omit a correct letter, replace one correct letter with an incorrect letter or have one error in the ordering of all of the correct letters. For example:
Milan
should match
Melan, Mlian, or Mlan but not Milano
The below function works very well at doing everything except the fuzzy matching and returning a tuple but I do not know how to do either of these things without a for loop. Not that I am against using a for loop but I still would not know how to implement this in a way that is computationally efficient.
Is there a way to add these functionalities that I am interested in having or am I trying to do too much in a single function?
def find_keyword_comments(sents, locations, state_init):
keywords = '|'.join(locations)
keywords1 = '|'.join(state_init)
word = re.compile(r"^.*\b({})\b.*$".format(locations), re.I)
word1 = re.compile(r"^.*\b({})\b.*$".format(state_init))
newlist = filter(word.match, test_comments)
newlist1 = filter(word1.match, test_comments)
final = list(newlist) + list(newlist1)
return final