Given a list of words and a sentence find all words that appear in the sentence either in whole or as a substring

Question

Problem

Given a list of string find the strings from the list that appear in the given text.

Example

list = ['red', 'hello', 'how are you', 'hey', 'deployed']
text = 'hello, This is shared right? how are you doing tonight'
result = ['red', 'how are you', 'hello']

'red' because it has 'shared' has 'red' as a substring

This is very similar to this question except that the word we need to look for can be substring as well.
The list is pretty large and increases with increase in the users as opposed to text which is pretty much the same length throughout.
I was thinking of having a solution where the time complexity depends on from the length of the text rather that list of words so that it can be scalable even when lots of users are added.

Solution

I build a trie from the give list of words
Run dfs on the text and check the current word against the trie

Psuedo-Code

def FindWord (trie, text, word_so_far, index):
    index > len(text)
        return
    //Check if the word_so_far is a prefix of a key; if not return
    if trie.has_subtrie(word) == false:
       return 
    //Check if the word_so_far is a key; if ye add to result and look further 
    if trie.has_key(word) == false:
        // Add to result and continue
    //extend the current word we are searching
    FindWord (trie, text, word_so_far + text[index], index + 1)
    //start new from the next index 
    FindWord (trie, text, "", index + 1)

The problem with this is although the runtime now depends on the len(text) it runs with a time complexity O(2^n) after building the trie which is a one time thing for multiple texts so it's fine.

I do not see any overlapping subproblems either to memoize and improve the runtime.

Can you suggest any way I can achieve a runtime that depends on given text as opposed to the list of words which can be per-processed and cached and is also faster that this.

What have you tried? Do you have an example of your own working code? — Jordan Singer, Jan 15 '19 at 16:45
I have posted the pseudo code of what i have done. the actual code has lot of irrelevant parts which may distract from the actual question — thebenman, Jan 15 '19 at 16:52
@JordanSinger OP has already given a reasonably informative description of his attempts, so there is no need to post the actual code, as the actual issue is with the *algorithm*. — meowgoesthedog, Jan 15 '19 at 16:55
How often do you recursively restart from "" at the same index? (or otherwise duplicate a previous starting point with a lot of work to redo) Maybe make a counter based on the tuple key (word_so_far, index) to see if this is happening. — Kenny Ostrom, Jan 15 '19 at 17:26

David Eisenstat · Accepted Answer · 2019-01-16T13:01:50.640

The theoretically sound version of what you're trying to do is called Aho--Corasick. Implementing the suffix links is somewhat complicated IIRC, so here's an algorithm that just uses the trie.

We consume the text letter by letter. At all times, we maintain a set of nodes in the trie where the traversal can be. Initially this set consists of just the root node. For each letter, we loop through the nodes in the set, descending via the new letter if possible. If the resulting node is a match, great, report it. Regardless, put it in the next set. The next set also contains the root node, since we can start a new match at any time.

Here's my attempt at a quick implementation in Python (untested, no warranty, etc.).

class Trie:
    def __init__(self):
        self.is_needle = False
        self._children = {}

    def find(self, text):
        node = self
        for c in text:
            node = node._children.get(c)
            if node is None:
                break
        return node

    def insert(self, needle):
        node = self
        for c in needle:
            node = node._children.setdefault(c, Trie())
        node.is_needle = True


def count_matches(needles, text):
    root = Trie()
    for needle in needles:
        root.insert(needle)
    nodes = [root]
    count = 0
    for c in text:
        next_nodes = [root]
        for node in nodes:
            next_node = node.find(c)
            if next_node is not None:
                count += next_node.is_needle
                next_nodes.append(next_node)
        nodes = next_nodes
    return count


print(
    count_matches(['red', 'hello', 'how are you', 'hey', 'deployed'],
                  'hello, This is shared right? how are you doing tonight'))

What would the worst case time complexity for the Trie based solution be though. There are `n^2` substring and each finding each substring is linear time in a trie. So `O(n^3)`? — thebenman, Jan 16 '19 at 17:02
@thebenman You have have a degenerate trie with a, aa, aaa, aaaa, etc. but that would only push it to size of trie times size of text (quadratic). — David Eisenstat, Jan 16 '19 at 18:42

score 1 · Answer 2 · answered Jan 17 '19 at 08:27

Extending @David Eisenstat suggestion to implement this using aho-corasick's algorithm. I found a straightforward python module(pyahocorasic) that can do this.

here is how the code would look like for the example given in the question.

import ahocorasick

def find_words(list_words, text):
    A = ahocorasick.Automaton()

    for key in list_words:
      A.add_word(key, key)

    A.make_automaton()

    result = []
    for end_index, original_value in A.iter(text):
      result.append(original_value)

    return result

list_words = ['red', 'hello', 'how are you', 'hey', 'deployed']
text = 'hello, This is shared right? how are you doing tonight'
print(find_words(list_words, text))

Or run it online

score 0 · Answer 3 · edited Jan 15 '19 at 20:33

If you are aiming for a faster code that depends on the text window, you can go for set lookups to speed things along. If feasible, change the lookup list into a set instead, then find all possible windows in the text to use for lookups.

def getAllWindows(L):
    tracker = set()
    for w in range(1, len(L)+1):
        for i in range(len(L)-w+1):
            sub_window = L[i:i+w]
            if sub_window not in tracker:
                tracker.add(sub_window)
                yield sub_window


lookup_list = ['red', 'hello', 'how are you', 'hey', 'deployed']
lookup_set = set(lookup_list)
text = 'hello, This is shared right? how are you doing tonight'
result = [sub_window for sub_window in getAllWindows(text) if sub_window in lookup_list]
print(result)
#Output:
['red', 'hello', 'how are you']

Given a list of words and a sentence find all words that appear in the sentence either in whole or as a substring

3 Answers3