Fetch Words Starting/Containing/Ending With Fragment in a Word Dictionary

Question

Assuming that we have a list of all dictionary words from A-Z from the English dictionary.

I have three cases to perform on these list of words:

1) find out all the words that are "starting with" a particular fragment

eg: If my fragment is 'car', word 'card' should be returned

2) find out all the words that "contains" the fragment as the substring

eg: If my fragment is 'ace', word 'facebook' should be returned

3) find out all the words that are "ending with" a particular fragment

eg: If my fragment is 'age', word 'image' should be returned

After some searching exercise on the internet, I found that 1) can be done through trie/compressed trie and 3) can be done through suffix tree.

I am unsure of how 2) can be achieved. Plus are there any better scenarios wherein all these three cases can be handled? As maintaining both prefix and suffix tree could be a memory intensive task.

Kindly let me know for any other areas to look out for.

Thanks in advance.

PS: I will be using C++ to achieve this

EDIT 1: For the time being I constructed a suffix tree with immense help from here.

Single Word Suffix Tree Generation in C Language

Here, I need to construct a suffix tree for the entire english dictionary words. So should I create a

a) separate suffix tree for each word OR

b) create a generalized suffix tree for all words.

I am not sure how to track individual trees for each word while substring matching in the a) case

Any pointers?

1) and 3) are subcases of 2) (that is, a prefix and a suffix are both substrings). So all you really need to solve is 2), which is of course the hardest one :-) — Cameron, Mar 27 '15 at 21:05

score 1 · Answer 1 · answered Mar 28 '15 at 04:20

As I pointed out in a comment, the prefix and suffix cases are covered by the general substring case (#2). All prefixes and suffixes are by definition substrings as well. So all we have to solve is the general substring problem.

Since you have a static dictionary, you can preprocess it relatively easily into a form that is fast to query for substrings. You could do this with a suffix tree, but it's far easier to construct and deal with simple sorted flat vectors of data, so that's what I'll describe here.

The end goal, then, is to have a list of sub-words that are sorted so that a binary search can be done to find a match.

First, observe that in order to find the longest substrings that match the query fragment it is not necessary to list all possible substrings of each word, but merely all possible suffixes; this is because all substrings can be merely thought of as prefixes of suffixes. (Got that? It's a little mind-bending the first time you encounter it, but simple in the end and very useful.)

So, if you generate all the suffixes of each dictionary word, then sort them all, you have enough to find any specific substring in any of the dictionary words: Do a binary search on the suffixes to find the lower bound (std::lower_bound) -- the first suffix that starts with the query fragment. Then find the upper bound (std::upper_bound) -- this will be one past the last suffix that starts with the query fragment. All of the suffixes in the range [lower, upper[ must start with the query fragment, and therefore all of the words that those suffixes originally came from contain the query fragment.

Now, obviously actually spelling out all the suffixes would take an awful lot of memory -- but you don't need to. A suffix can be thought of as merely an index into a word -- the offset at which the suffix begins. So only a single pair of integers is required for each possible suffix: one for the (original) word index, and one for the index of the suffix in that word. (You can pack these two together cleverly depending on the size of your dictionary for even greater space savings.)

To sum up, all you need to do is:

Generate an array of all the word-suffix index pairs for all the words.
Sort these according to their semantic meaning as suffixes (not numerical value). I suggest std::stable_sort with a custom comparator. This is the longest step, but it can be done once, offline since your dictionary is static.
For a given query fragment, find the lower and upper bounds in the sorted suffix indices. Each suffix in this range corresponds to a matching substring (of the length of the query, starting at the suffix index in the word at the word index). Note that some words may match more than once, and that matches may even overlap.

To clarify, here's a minuscule example for a dictionary composed of the words "skunk" and "cheese".

The suffixes for "skunk" are "skunk", "kunk", "unk", "nk", and "k". Expressed as indices, they are 0, 1, 2, 3, 4. The suffixes for "cheese" are "cheese", "heese", "eese", "ese", "se", and "e". The indices are 0, 1, 2, 3, 4, 5.

Since "skunk" is the first word in our very limited imaginary dictionary, we'll assign it index 0. "cheese" is at index 1. So the final suffixes are: 0:0, 0:1, 0:2, 0:3, 0:4, 1:0, 1:1, 1:2, 1:3, 1:4, 1:5.

Sorting these suffixes yields the following suffix dictionary (I added the actual corresponding textual substrings for illustration only):

0  | 0:0 | cheese
1  | 0:5 | e
2  | 0:2 | eese
3  | 0:3 | ese
4  | 0:1 | heese
5  | 1:4 | k
6  | 1:1 | kunk
7  | 1:3 | nk
8  | 0:4 | se
9  | 1:0 | skunk
10 | 1:2 | unk

Consider the query fragment "e". The lower bound is 1, since "e" is the first suffix that is greater than or equal to the query "e". The upper bound is 4, since 4 ("heese") is the first suffix that is greater than "e". So the suffixes at 1, 2, and 3 all start with the query, and therefore the words they came from all contain the query as a substring (at the suffix index, for the length of the query). In this case, all three of these suffixes belong to "cheese", at different offsets.

Note that for a query fragment that's not a substring of any of the words (e.g. "a" in this example), there are no matches; in such a case, the lower and upper bounds will be equal.

Micromega · Answer 2 · 2015-03-27T21:01:29.140

0

You can try aho corasick algorithm. The automaton is in fact a trie and the failure function is a breadth-first traversal of the trie.

edited Mar 27 '15 at 21:01

answered Mar 27 '15 at 20:42

Micromega

12,486
7
35
72

Aho Corasick is for the inverse problem where you have multiple substrings that you want to find in one big one -- here there's one substring that needs to be found in multiple big ones. – Cameron Mar 27 '15 at 21:03
@Cameron:Imo you mean the Kmp-algorithm? Aho corasick is more generalized form! – Micromega Mar 28 '15 at 13:26
No, neither of them. Both of those are for searching a single long string for matches of another (or others in the case of Aho Corasick). As I understand the question, the goal is to search through many strings (in the dictionary) for matching substrings. As far as I know that cannot be done with Aho Corasick. Perhaps I'm wrong -- in that case, could expand your answer? – Cameron Mar 28 '15 at 15:24
@Cameron:Maybe you can use a wildcard with aho corasick algorithm? Split the search pattern at the wildcard!? – Micromega Mar 28 '15 at 15:53

Fetch Words Starting/Containing/Ending With Fragment in a Word Dictionary

2 Answers2