I am working on a problem where I need to efficiently check if any keywords from a large set of keywords exist in a moderate-sized English sentence. The size of the sentence is m, and I have n keywords, with an average keyword size of k.
I have a list of keywords, and the matching rules are as follows:
Exact Match: If a keyword is surrounded by double quotes (""), it is considered an exact match. In this case, the sentence must contain the entire keyword phrase within the quotes, in the exact order. For example, the string "working from my office home" would match the exact keyword "office home", but not "home office".
Partial Match: If a keyword is without double quotes, it is considered a partial match. The order of the words is not important, and the sentence can contain variations of the words. For example, the string "working from my home office" would match a partial keyword like "home office". Each word in the keyword must be a part of the sentence. For example, the keyword off is a partial match of "office"
Combination of Exact and Partial Matches: There can be a combination of exact and partial matches within the same keyword. For instance, the keyword "tic toe" tac would match the sentence "tic toe" tac but not tic tac toe. In this case, "tic toe" is considered as an exact match and "tac" as a partial match.
I'm aware of the naive approach, which has a time complexity of O(n*m) since each contains operation costs O(m) and we have n keywords to check.
I've also considered using a Trie, which would reduce the complexity to O(m*k) since we check each character in the string against the Trie. Since the average size of the keyword is k, this approach seems more efficient.
I recently came across the Aho-Corasick algorithm, which I believe provides similar functionality. Could someone explain the differences between using a Trie and the Aho-Corasick algorithm for keyword matching? Are there any other algorithms that I should be aware of for this problem? What are the pros and cons of considering other data structures and the naive approach?
Additionally, I should mention that the number of keywords is quite large, while the size of the string is relatively smaller compared to the number of keywords. Given these constraints, which algorithm would be the most suitable for efficient keyword matching in this scenario?
In a Trie, the time complexity to find all the keywords is still O(km), right? Let's consider the string "abcde" and the keywords "ab," "bc," and "de." First, I check if the string is a prefix of any keyword, which takes O(m) time. Only the prefixes "a," "b," and "d" yield results. Then, I continue traversing the Trie until I find the keywords "ab," "bc," and "de." So, the overall time complexity is O(km).