As I pointed out in a comment, the prefix and suffix cases are covered by the general substring case (#2). All prefixes and suffixes are by definition substrings as well. So all we have to solve is the general substring problem.
Since you have a static dictionary, you can preprocess it relatively easily into a form that is fast to query for substrings. You could do this with a suffix tree, but it's far easier to construct and deal with simple sorted flat vectors of data, so that's what I'll describe here.
The end goal, then, is to have a list of sub-words that are sorted so that a binary search can be done to find a match.
First, observe that in order to find the longest substrings that match the query fragment it is not necessary to list all possible substrings of each word, but merely all possible suffixes; this is because all substrings can be merely thought of as prefixes of suffixes. (Got that? It's a little mind-bending the first time you encounter it, but simple in the end and very useful.)
So, if you generate all the suffixes of each dictionary word, then sort them all, you have enough to find any specific substring in any of the dictionary words: Do a binary search on the suffixes to find the lower bound (std::lower_bound
) -- the first suffix that starts with the query fragment. Then find the upper bound (std::upper_bound
) -- this will be one past the last suffix that starts with the query fragment. All of the suffixes in the range [lower, upper[ must start with the query fragment, and therefore all of the words that those suffixes originally came from contain the query fragment.
Now, obviously actually spelling out all the suffixes would take an awful lot of memory -- but you don't need to. A suffix can be thought of as merely an index into a word -- the offset at which the suffix begins. So only a single pair of integers is required for each possible suffix: one for the (original) word index, and one for the index of the suffix in that word. (You can pack these two together cleverly depending on the size of your dictionary for even greater space savings.)
To sum up, all you need to do is:
- Generate an array of all the word-suffix index pairs for all the words.
- Sort these according to their semantic meaning as suffixes (not numerical value). I suggest
std::stable_sort
with a custom comparator. This is the longest step, but it can be done once, offline since your dictionary is static.
- For a given query fragment, find the lower and upper bounds in the sorted suffix indices. Each suffix in this range corresponds to a matching substring (of the length of the query, starting at the suffix index in the word at the word index). Note that some words may match more than once, and that matches may even overlap.
To clarify, here's a minuscule example for a dictionary composed of the words "skunk" and "cheese".
The suffixes for "skunk" are "skunk", "kunk", "unk", "nk", and "k". Expressed as indices, they are 0, 1, 2, 3, 4
. The suffixes for "cheese" are "cheese", "heese", "eese", "ese", "se", and "e". The indices are 0, 1, 2, 3, 4, 5
.
Since "skunk" is the first word in our very limited imaginary dictionary, we'll assign it index 0. "cheese" is at index 1. So the final suffixes are: 0:0, 0:1, 0:2, 0:3, 0:4, 1:0, 1:1, 1:2, 1:3, 1:4, 1:5
.
Sorting these suffixes yields the following suffix dictionary (I added the actual corresponding textual substrings for illustration only):
0 | 0:0 | cheese
1 | 0:5 | e
2 | 0:2 | eese
3 | 0:3 | ese
4 | 0:1 | heese
5 | 1:4 | k
6 | 1:1 | kunk
7 | 1:3 | nk
8 | 0:4 | se
9 | 1:0 | skunk
10 | 1:2 | unk
Consider the query fragment "e". The lower bound is 1, since "e" is the first suffix that is greater than or equal to the query "e". The upper bound is 4, since 4 ("heese") is the first suffix that is greater than "e". So the suffixes at 1, 2, and 3 all start with the query, and therefore the words they came from all contain the query as a substring (at the suffix index, for the length of the query). In this case, all three of these suffixes belong to "cheese", at different offsets.
Note that for a query fragment that's not a substring of any of the words (e.g. "a" in this example), there are no matches; in such a case, the lower and upper bounds will be equal.