Efficient keyword matching algorithm for large keyword sets

Question

I am working on a problem where I need to efficiently check if any keywords from a large set of keywords exist in a moderate-sized English sentence. The size of the sentence is m, and I have n keywords, with an average keyword size of k.

I have a list of keywords, and the matching rules are as follows:

Exact Match: If a keyword is surrounded by double quotes (""), it is considered an exact match. In this case, the sentence must contain the entire keyword phrase within the quotes, in the exact order. For example, the string "working from my office home" would match the exact keyword "office home", but not "home office".

Partial Match: If a keyword is without double quotes, it is considered a partial match. The order of the words is not important, and the sentence can contain variations of the words. For example, the string "working from my home office" would match a partial keyword like "home office". Each word in the keyword must be a part of the sentence. For example, the keyword off is a partial match of "office"

Combination of Exact and Partial Matches: There can be a combination of exact and partial matches within the same keyword. For instance, the keyword "tic toe" tac would match the sentence "tic toe" tac but not tic tac toe. In this case, "tic toe" is considered as an exact match and "tac" as a partial match.

I'm aware of the naive approach, which has a time complexity of O(n*m) since each contains operation costs O(m) and we have n keywords to check.

I've also considered using a Trie, which would reduce the complexity to O(m*k) since we check each character in the string against the Trie. Since the average size of the keyword is k, this approach seems more efficient.

I recently came across the Aho-Corasick algorithm, which I believe provides similar functionality. Could someone explain the differences between using a Trie and the Aho-Corasick algorithm for keyword matching? Are there any other algorithms that I should be aware of for this problem? What are the pros and cons of considering other data structures and the naive approach?

Additionally, I should mention that the number of keywords is quite large, while the size of the string is relatively smaller compared to the number of keywords. Given these constraints, which algorithm would be the most suitable for efficient keyword matching in this scenario?

In a Trie, the time complexity to find all the keywords is still O(km), right? Let's consider the string "abcde" and the keywords "ab," "bc," and "de." First, I check if the string is a prefix of any keyword, which takes O(m) time. Only the prefixes "a," "b," and "d" yield results. Then, I continue traversing the Trie until I find the keywords "ab," "bc," and "de." So, the overall time complexity is O(km).

Also consider: You have to read each *character* in the input exactly once. With the trie, you only compare each input character with a single character in the trie. Ergo, the trie is O(n), which is the minimum. — Mooing Duck, Jul 07 '23 at 21:00
HashSet is not a feasible solution because the processing time of O(n*m) is too high for me. I need to reduce it to nearly linear time. The sizes are such that K < m < n, where n is significantly larger than m (e.g., 100K keywords). — Zvi Mints, Jul 07 '23 at 21:09
But in a Trie, the time complexity to find all the keywords is still O(km), right? Let's consider the string "abcde" and the keywords "ab," "bc," and "de." First, I check if the string is a prefix of any keyword, which takes O(m) time. Only the prefixes "a," "b," and "d" yield results. Then, I continue traversing the Trie until I find the keywords "ab," "bc," and "de." So, the overall time complexity is O(km), isn't it? — Zvi Mints, Jul 07 '23 at 21:11
All your keywords are complete words? Keyword has a match in text only if this match is a complete word too, surrounded by non-alphabetic characters? — maxplus, Jul 07 '23 at 21:13
To better understand my needs, I will explain in detail, although I intended to keep it brief. — Zvi Mints, Jul 07 '23 at 21:22
I have a list of keywords, such as "home," "office," "home office," or "office home." If a keyword is surrounded by double quotes (""), it is considered an exact match. In this case, the string must contain the word within the quotes, such as "working from my office home." On the other hand, if a keyword is without double quotes (partial match), the order of the words is not important and it can contain variations of the word. For example, the string "working from my home office" would match a partial keyword like "home office." — Zvi Mints, Jul 07 '23 at 21:22
the keyword `off` is partial match of the `office`, but keyword `"off"` is not partial match of `office` but match of `off white` — Zvi Mints, Jul 07 '23 at 21:23
This piece of information is very important, thanks for mentioning it. How about keyword `wife` and sentence `wives`, you don't expect the algorithm to find a match, right? Whereas keyword `ive` should match. — maxplus, Jul 07 '23 at 21:25
Correct, keyword `ive` is match `wives`, but `"ive"` not a "full" word in the sentence — Zvi Mints, Jul 07 '23 at 21:29
Before finding a solution i'm trying to understand the different complexities about native approach, Trie and possible other solutions like (Aho Corasick) and Motivation, Cons and Pros — Zvi Mints, Jul 07 '23 at 21:29
Your confusion is thinking of it as "The size of the sentence is m, and I have n keywords, with an average keyword size of k." Note that if the average keyword size is k, then the average *input* word size is probably also K. If so, then O(mk) *is* linear time, relative to the characters in the input file. Hashset and trie are *both* O(mk), and therefore linear time, and nothing is going to be algorithmically faster. — Mooing Duck, Jul 07 '23 at 21:54
"the order of the words is not important and it can contain variations of the word" -- so keyword `tic toe` matches sentence `toes ticking` but not `tic tac toe`? — maxplus, Jul 07 '23 at 22:06
@MooingDuck if you check details Zvi provided, neither trie not hashset are able to solve this problem in `O((m+n)k)` — maxplus, Jul 07 '23 at 22:09
`tic toe` matches sentence `toes ticking` and *ALSO* `tic tac toe`, since `tic` in `tic tac toe` and `toe` in `tic tac toe`. If its was `"tic toe"` so its will match only `tic toe tac` and not `tic tac toe`. Simplify it, if its exact match - the keyword must be in the sentence in the order if its partial match - each word in the keyword but be part of the sentece — Zvi Mints, Jul 07 '23 at 22:26
@MooingDuck How in Hashset is O(mk)? I need to traverse all the keywords, its already `n` and then for each keyword check if its exists on sentence with the logic I'm mentioning at @maxplus which takes at least O(`m`) — Zvi Mints, Jul 07 '23 at 22:28
@ZviMints, please add all this logic to the body of the question. Different answers to my questions would mean different problems with different solutions, the most straightforward interpretation of what you wrote in your question would be better approached using trie/hashset with complexity `O((m+n)k)`, but that solution is irrelevant for your actual problem. — maxplus, Jul 07 '23 at 22:41
Two other details. The algorithm should return only `true`/`false`, not list all matching keywords, correct? Do you indeed intend to search only through a single text? — maxplus, Jul 07 '23 at 22:43
@ZviMints: I had assumed that partial matches weren't allowed, but upon rereading the comments, I now see that partial matches are allowed. (keyword `off` matching sentence `office`) You're right that hashmaps and tries don't quite cover that. — Mooing Duck, Jul 08 '23 at 04:29
I need to get all the keywords which are matched, I will update the question shortly, thanks! — Zvi Mints, Jul 08 '23 at 05:39
I still have no idea how this is supposed to be answered. And you haven't answered my last question. Aho-Corasick algorithm can be modified to solve this problem in linear time, trie can not. But apparently you haven't read anything about how standard Aho-Corasick works and don't understand very well how trie works, which is a pre-requisite for standard Aho-Corasick, so posting an answer about how to modify Aho-Corasick would be pointless. — maxplus, Jul 08 '23 at 07:35
You started to ask about details, I'm just wanted to understand the motivation of Naive Approach, vs Trie, vs Aho Corasick since most of what I read is not taking the characters into consideration, for e.g Trie from my understanding is O(m*k) while Aho Corasick is O(m+n+z), and n in my example is much higher than k, for e.g average characters is 15 but n is about 100K — Zvi Mints, Jul 08 '23 at 07:41
If you can provide a explanation and motivations and Pros vs Cons its can be great — Zvi Mints, Jul 08 '23 at 07:41

score 1 · Answer 1 · answered Jul 09 '23 at 05:55

https://stackoverflow.com/a/21128777/56778 might be of some interest.

I would suggest the Aho-Corasick algorithm. It matches partial words, but it's easy enough to post-process the output to filter out the partial matches. With 100K search terms, I wouldn't recommend trying to build a regex for it. The algorithm is simple enough to implement if you study the original paper, which is available at https://dl.acm.org/doi/pdf/10.1145/360825.360855 (PDF).

Note that the algorithm's complexity as described in the paper is linear in the length of the search strings plus the length of the searched text plus the number of output matches. I think you'd be hard pressed to find a more efficient solution.

The algorithm does have pathological cases that will produce a quadratic number of matches. that you should be aware of those, but in searching text for English (or other language) words, you shouldn't those cases. See the paper for details on those pathological cases.

You might also consider not trying to "roll your own." It's quite possible that the standard GNU fgrep program will do what you want. I don't know how it would handle 100K search strings, but it'd be simple for you to find out. Again, you'd have to post-process the output to eliminate erroneous partial matches, but that wouldn't be any more difficult than doing it in your custom program.

n. m. could be an AI · Answer 2 · 2023-07-09T08:41:08.510

1

Consider a hybrid approach. You may find exact matches and partial matches using separate algorithms and data structures.

Consider three different match types: an exact word match, a partial word match, and an exact phrase match.

Exact word matches are the easiest: just use a hash table and look up each word separately. That's O(k) per word or O(m) for the entire sentence.

Partial word matches can be tackled with a trie. You will need to run each suffix of each word through the trie to find matches in the middle of the word. That's O(k^2) per word. Since there are O(m/k) words, the overall complexity is O(mk).

Finally exact phrase matches can be solved with another trie, with words (as opposed to characters) as nodes. Here you run each (word-level) suffix of the sentence through the trie. Since there are O(m/k) words, there are O(m/k) suffixes of length O(m/k) each, so the overall complexity in this case is O((m/k)^2).

Combination matches can be found by just breaking up combination keywords into exact phrases and partial-match words, for example the combination keyword "tic toe" tac is equivalent (?) to having a "tic toe tac" phrase keyword and a tac partial-match keyword.

Note that the size of the dictionary n does not affect any of the complexities.

edited Jul 09 '23 at 08:41

answered Jul 09 '23 at 06:59

n. m. could be an AI

112,515
14
128
243

I think I have a different use-case, I get a sentence, for e.g `going to watch a movie` and I need to find all the keywords, if the keywords are `["watch", "to watch", ing, movie, soccer]` so the hits are `["watch", "to watch", ing, movie]` Not sure how its not depends on the size of the keywords – Zvi Mints Jul 09 '23 at 08:01
Let's ho through the three parts of the algorithm. First you separate your sentence to words (that would be `going` `to` `watch` `a` `movie`) and search them in the hash table of single word keywords (that would be `watch` `ing` `movie`). You get two matches `watch` and `movie`. Each search in a hash table is O(k). amortised. – n. m. could be an AI Jul 09 '23 at 08:11
Second you form all word-level suffixes, that would be `going` `oing` `ing` `ng` `g` `to` `o` `watch` `atch` etc, and run them through the trie of individual words. You will find all the matches you found in the first step, plus one new match `ing`. (Thus the first step is redundant, but you may still want to perform it if you want to get a partial list of matches out quickly before outputting a full list). Each suffix takes time proportional to its size, no dependency on the size of the trie. – n. m. could be an AI Jul 09 '23 at 08:15
Thanks, I think that its not suitable to my use-case, since for e.g `"to watch" soccer` may be also a word, and its not a match since even if `"to watch"` exists, soccer is not a partial match, so putting them into dictionary is not a suitable solution, doesn't it? – Zvi Mints Jul 09 '23 at 08:16
Also consider an `n` keywords, to construct suffix tree for each keyword is `O(k)` when k is the average of the keyword, right? (if my use-case was only partial matching without combination of partial and exact) – Zvi Mints Jul 09 '23 at 08:20
Finally you form a list of word-level suffixes of the sentence, that would be `going, to, watch, a, movie`, `to, watch, a, movie` `watch, a, movie` `a, movie` `movie`, and run it through a trie of phrases. This trie contains just two nodes in your example, `to -> watch` and the suffix `to, watch, a, movie` produces a hit. Again, no dependency on the size of the trie. – n. m. could be an AI Jul 09 '23 at 08:21
And just to understand - the first approach was to find *exact* matches, take the text - split it by words - search in the dictionary - `O(n)` when n is the number of words in the keywords? - the second approach was to create suffix tree for each keyword as described above ^. the issue its also heavy to create and also not supporting exact + partial match right? – Zvi Mints Jul 09 '23 at 08:21
`"to watch" soccer` may be also a word --- I don't quite understand what this means. – n. m. could be an AI Jul 09 '23 at 08:22
My "dictionary" is keywords, each keyword may contain exact match (phrase started and ending with `"`) and partial match which is just a word. A match is when all of the partial rules and exact rules met, i.e when all the partial words in the keyword are exists in the text (its may also be `ing` for e.g) and also when all the exact phrases are also contains the text – Zvi Mints Jul 09 '23 at 08:24
Creating a trie is proportional to the size of the trie, but you are supposed to use the same trie for many queries so this is a one-time cost. The second part supports partial matches, why do you think it doesn't? A trie can find prefixes and you form a list of suffixes. A prefix of a suffix is an arbitrary substring. – n. m. could be an AI Jul 09 '23 at 08:24
for e.g, the sentece `cnn 2023 01 11 politics republican irs funding 87000 agents` – Zvi Mints Jul 09 '23 at 08:24
will create match for ```List(politics, "politics republican", public, pub, funding irs, irs funding, cnn "republican irs", "republican irs" cnn public "politics republican", "2023" "republican irs")``` – Zvi Mints Jul 09 '23 at 08:25
and will not create match for ```List("agents irs", political, "pub", bbc "republican irs")``` – Zvi Mints Jul 09 '23 at 08:25
Please give coherent examples. If your dictionary contains quoted phrases or unquoted (partial) words, then `"to watch" soccer` cannot be in a dictionary because it is neither. – n. m. could be an AI Jul 09 '23 at 08:26
I provided full example for my use-case, hope you find it good enough – Zvi Mints Jul 09 '23 at 08:27
Each element in the List is a **keyword** – Zvi Mints Jul 09 '23 at 08:27
Your example in the main text contains `the keyword "tic toe" tac` but I don't understand what exactly it would match. Is it equivalent to having three separate keywords `"tic toe tac"` and `tac`? Or is it equivalent to having three separate keywords `"tic toe tac"`, `"tic toe"` and `tac`? Either way you just break your combination keyword to a set of simple exact match and partial match keywords. – n. m. could be an AI Jul 09 '23 at 08:34
the keyword `"tic tac" toe` should match a sentence which have `tic tac` in that specific order, i e.g `i love to play tic tac` but not `i love to play tac tic` and **also** should contain the word `toe` somewhere in the text, for example this is a match `i love play toe to tic tac`, if `toe` or `"tic tac"` is not exists in the sentence, its not considered a match – Zvi Mints Jul 09 '23 at 08:37
"and also should contain the word toe somewhere in the text" Huh? `i love to play tic tac` does not contain `toe` anywhere. Should it match or not? – n. m. could be an AI Jul 09 '23 at 08:45
At any rate I recommend to rewrite your question and include all the information you put in the comments, with a lot more examples than you currently have. Perhaps even ask a new question. You may or may not have a much harder problem than what is presented in the question as it stands. – n. m. could be an AI Jul 09 '23 at 08:47

Efficient keyword matching algorithm for large keyword sets

2 Answers2