0

Given a "large" list of patterns and a "short" text, what is the best/fastest way to search/tag those patterns in the text, where we are trying to find the pattern as a substring of the text? If there are multiple matches of a pattern in a text, we want to ideally find all of them.

To be more specific, the texts are actually streaming queries and the patterns to look for are named entities. We need an entire pattern to match in full. Training a NER model to tag entities is not an option. By "big" list, I mean a few hundred thousand entities to look up. By "short" text, I mean an average of 10 words.

e.g. :

Text: the actor who plays the black widow in the avengers.

I am considering tries and FSTs. Trying to understand the pros and cons of both in this particular scenario. Any pointers would be appreciated.

Satarupa Guha
  • 1,267
  • 13
  • 20

1 Answers1

1

You could take a look at the Aho-Corasick algorithm. This algorithm constructs a finite state machine from all search patterns, basically a trie but with some extra edges. It then uses this trie to search an input string for all search patterns simultaneously. The time complexity is O(n + m + z); n = length of input text, m = total characters in all search patterns, and z is the number of occurrences of search patterns in you input text.

However, this time complexity assumes you build the trie for each search, so if you build the trie up front (given it seems your search patterns do not change), and save it to memory, I think you can then search strings against the pre computed trie (finite state machine) in O(n) going forward.

Eric
  • 504
  • 4
  • 4
  • Thanks, Eric. Any idea about how this approach compares with using something like WFST? – Satarupa Guha Dec 08 '21 at 08:30
  • 1
    I'm not too familiar with WFST, but the major differences seem to be WFST’s ability to 1) produce an output while simultaneously pattern searching, and 2) match patterns based on context via probabilities. Aho Corasick is a FSA based algorithm, vs WFST (a Finite State Transducer), and as such, it is best suited to search for literal substrings. If this is all you need, the contextual matching of FST / WFST seems like it might be overkill. If the the extra capabilities of WFST are useful, the foundation seems to resembles Aho Corasick. Hopefully others can weigh in further. – Eric Dec 08 '21 at 16:00