1

Let me start with an example. Consider the following list in python

cities = [
    'New york'
    'San francisco',
    'California',
    'Las vegas',
    'Chicago',
    'Miami'
]

I also have following sentences.

sentences = [
    "Both of us were new to New York City, and had few or no friends.",
    "Win three more games and he becomes king of San Francisco.",
    "Uncurling from the couch, she started to the bedroom of her father's small Miami apartment."
]

For each sentence, find out substring in this sentence that is closest to any string present in the list. So, in this example, I want to get the longest substring for each sentence in sentences which is closest to any string in the list cities. Therefore, in this case, my result should look like:

desired_result = [
    'New york',
    'San fransisco',
    'Miami'
]

There are few algorithms that are in my mind, but they are not ideal.

Algorithm 1

One algorithm can give very good results, but it is very bad in terms of time complexity. I tried to extract all sub phrases of a sentence, starting from n word sub-phrase to one word sub-phrase for a sentence with n tokens. Then I use difflib.get_close_matches function to detect any sub-phrase that is closest to any string in the list of cities. However, the complexity as we can clearly see is very high. For sentence of length n, we have total O(n*n) sub-phrases. Also, the list of cities is not small. In my real-world use-case this list contains around 7 million strings.

In this case, my code looks like this:


def generate_subphrases(sen): 
    subphrases: List[str] = []
    # My logic to generate all possible subphrases
    # . 
    # . 
    # . 
    return subphrases


result = []
for sen in sentences:
    subphrases = generate_subphrases(sen)
    ans = None
    for phrase in subphrases:
        if get_close_matches(phrase, cities):
            ans = phrase
            break
    result.append(ans)

print(result)

Algorithm 2

This is bit faster compared to previous approach, however, this is not as good as last one. The benefit of using last approach was that we could tolerate few mismatches in that approach. For example, New York would be detected if the cities list even contained New york. However, in this case we do not tolerate even single character mismatch. In my use-case, I can tolerate error upto 30-35% in terms of character mismatch. In this approach, I form huge regular expression with union of all cities in the list. Then I use re.search to search for the sub-phrase in my sentence. In my opinion, this is faster but not very good.

I want to know if I can use any data structure to achieve this task, or any python utility function similar to difflib.get_close_matches that can allow searching over entire sentence.

Update

My ultimate goal is to make the algorithm 1 more efficient, may be using some string algorithm I might not be familiar with or probably some data structure. I also thought of `Trie` data structure once, but again, this is helpful for exact match and not the soft matching like the python utility function described in algorithm 1 provides.

Note: I'm not doing NER task in this case. The example provided was just to illustrate things easily. For this reason, I can not use pre-trained machine learning models like Spacy or NLTK to identify cities. The overall goal is not to identify cities, but to identify sub-phrase in a string that is closest to any string from a list of strings

Ruchit
  • 336
  • 3
  • 16
  • Why are you telling us **two** algorithms, if you want to know something related to the code you posted, which, according to what you wrote, belongs to the first algorithm? – user1934428 Oct 05 '21 at 15:30
  • Maybe have a look at the NLTK package https://realpython.com/nltk-nlp-python/ -- https://www.nltk.org/ – hpelleti Oct 05 '21 at 15:32
  • @user1934428, I tried to show how I tried to approach this problem. Now I also think that in my usecase, algorithm 2 might be an incorrect solution. Apologies for posting second algorithm. I want to achieve what I do in algorithm 1 more efficiently, probably using some data structure or efficient algorithm. – Ruchit Oct 05 '21 at 15:34
  • @Ruchit : Point out in your question which part of your code you want to make more efficient, and what alternatives you have considered. – user1934428 Oct 05 '21 at 15:36
  • @user1934428, added an update. I want to make algorithm 1 more efficient – Ruchit Oct 05 '21 at 15:40
  • try with ahocorasick algorithm – GolamMazid Sajib Oct 05 '21 at 16:40

1 Answers1

1

If you can create the possible matching strings before running the algorithm, pyahocorasick would be a perfect solution for your use-case since it precomputes a trie with all the cities you're trying to match.

The downside is you'd need to provide the variations / possible character mismatch patterns.

For your naive Algorithm 1, I'd recommend only returning subphrases of size up to M where M is the longest token in the list of strings you have. (There's no point trying to match a 10 word subsentence with string which are only up to 3 words long for example). This should at least help speed things up.

hbejgel
  • 4,674
  • 2
  • 17
  • 27
  • Yeah. Limiting length can definitely help much. Thanks for providing the suggestion about `pyahocorasick`. I will check this out. – Ruchit Oct 05 '21 at 15:47
  • Although after looking at the data I have, it turns out that the list contains longer phrases as well, upto length of 5 to 7. This is a list of product names of different companies. – Ruchit Oct 05 '21 at 17:05