What algorithms are used for seaching subwords in words?

Question

I know, maybe this question is stupid but I am stuck and I need real help. I need in my project use algorithm that could find all words what starts by another and return all words' tails. For example: Find all words that start by: dad

In dictionary we have:

dada, dadaism, daddled, daddling

Result:

a, aism, dled, dling

I have dictionary with all words, so all what I need is only algorithm. Someone suggested me to use patricia algorithm but I couln't find any sample for C#. My dictionary is very big so I need find also very fast algorithm.

More information:

Dictionary is sorted.

score 3 · Answer 1 · answered Dec 10 '11 at 12:47

3

This sounds like a perfect use for a Trie/DAWG (directed acyclic word graph). I understand that Patricia tree is a trie-like variation. There's a nice article about Tries and an implementation here.

answered Dec 10 '11 at 12:47

spender

117,338
33
229
351

Ohh I forgot tries :)), but creating trie in the case of available list is not good, Also If we just interested in StartsWith or EndsWith (as question said) and we allowed to create a trie, It's better to create a sorted list. – Saeed Amiri Dec 10 '11 at 12:49
@Eric Lippert discusses tries a little in this answer: http://stackoverflow.com/questions/8326446/how-do-i-quickly-find-the-longest-matching-string-in-c-net/8329927#8329927 – spender Dec 10 '11 at 12:54
I know what is trie :) I implemented one of them around 8 years ago :) but really it wasn't fast enough, may be my implementation has problem or other problems, anyway, question here asked for finding all items with special string matching (e.g StartsWith) not for finding exact string, may be this string matching pattern is not just for startswith, in this case, I can't see how trie is useful. – Saeed Amiri Dec 10 '11 at 13:03

score 3 · Accepted Answer · answered Dec 10 '11 at 15:19

How you make this work will depend on how your dictionary is arranged. If it's a sorted list of words, then you can use binary search to find the first word that starts with "dad", and then loop through just those using StartsWith and Substring. That is:

List<string> Words = LoadWords(); // however you load them
Words.Sort();

// Now, search for "dad" (or whatever)
string prefix = "dad";

int index = Words.BinarySearch(prefix);

// If the returned index is negative, the word wasn't found.
// The index is the one's compliment of the the place where it would be in the list.
if (index < 0)
{
    index = ~index;
}

for (int i = index; i < Count && Words[i].StartsWith(prefix))
{
    Console.WriteLine(Words[i].Substring(prefix.Length));
}

This should be very fast. The sort is a one-time cost after loading. And you can eliminate it altogether if you store the dictionary in sorted order. The binary search is O(log n), where n is the number of words in the dictionary.

If your dictionary is unordered, then you'll have to go through all the words, which is going to take a lot of time.

There are other organizations for your dictionary, that will make it take a lot less space and that could potentially be faster. Those are somewhat more complicated and take a lot more time to build than creating a sorted list.

+1 love how the simplest datastructure provides the (imo) best solution. — Nicolas78, Dec 10 '11 at 15:22
@Nicolas78: That's often the case when requirements are simple. Things change quite a bit when the requirements become more complicated. Imagine, for example, if you wanted to find all words that contain the substring "dad". — Jim Mischel, Dec 10 '11 at 15:43

score 1 · Answer 3 · edited Dec 29 '11 at 20:24

1

The most famous one that I know is "knuth morris pratt string matching algorithm".

If you take a look at the link, there are some others like Boyer–Moore string search algorithm, ... These are general algorithms, but if you are interested in special cases like start by, ... in most cases languages has this cases, for example in C# you can use StartsWith, EndsWith, there is no need to implement them again.

edited Dec 29 '11 at 20:24

TonySalimi

8,257
4
33
62

answered Dec 10 '11 at 12:40

Saeed Amiri

22,252
5
45
83

2

@Saaed KMP is best suited to searching a large single string rather than a list of strings. – spender Dec 10 '11 at 12:41
Yeah, KMP is for pattern search, but in this case if we have a sorted list we can use this pattern search and it works fast, and if we don't have sorted list again this can help for faster search, but I don't know any other way, trie as other offered is for searching complete word not for special pattern. – Saeed Amiri Dec 10 '11 at 18:35

score 1 · Answer 4 · answered Dec 10 '11 at 20:07

For loops may be quite fast for very small dictionaries. But if you have matching sets from thousands of words it will be very slow. Assuming that your dictionary is sorted (if not sort it) you can use the BinarySearch function to locate both the first and the last range items and then go with a for loop to create your results.

To be more practical, I have a (sorted) dictionary with 354984 words including theese 35 words starting from dad: dad, dad's, dada, dadaism, dadaisms, dadaist, dadaistic, dadaistically, dadaists, dadap, dadas, dadburned, dadder, daddies, dadding, daddle, daddled, daddles, daddling, daddock, daddocky, daddums, daddy, daddynut, dade, dadenhudd, dading, dado, dadoed, dadoes, dadoing, dados, dadouchos, dads and daduchus. If I follow Jim's approach, I will have to perform 35 "StartsWith" which is OK. In case of "sat" prefix I have 228 words and in case of "cat" prefix I have 692 words. For the size of my dictionary I need a total of 40 string comparisons (worst case) to locate the first and the last items.

If you are willing to use any trie implementation, be sure that supports at least numbers and dashes if your dictionary includes records like 1st or real-time.

bragboy · Answer 5 · 2016-10-25T11:57:05.333

0

You can use a TRIE for this. You can find a comprehensive implementation and a tutorial here.

Basically, in this structure, you will end up traversing from Root to 'd' then to 'a' then to 'd'. You will reach to a point where you want all the words that are starting with 'dad'. Considering this as the root node now, all you have to do is explore all the possible paths underneath and there goes your algorithm

edited Oct 25 '16 at 11:57

answered Dec 10 '11 at 12:50

bragboy

34,892
30
114
171

In BS I also implemented trie, but really it wasn't fast enough in action, Is your try fast enough? did you bench mark it? I implemented a trie in C#, and it just was fast if we want to search a very very big document but for data which is smaller than 2meg it wasn't fast. – Saeed Amiri Dec 10 '11 at 12:57
@SaeedAmiri : I did not benchmark it, but you can see a sample UI in the link and it retrieves word very fast. Only overhead for TRIE ofcourse is building it. But after that, it will be jet fast. No doubt in that. The Big O speaks for itself in the practical case of a TRIE. – bragboy Dec 10 '11 at 13:10

score 0 · Answer 6 · answered Dec 10 '11 at 12:55

0

if you need something simple, you can try this:

        string[] dict = new string[] { "dada", "dadaism", "daddled", "daddling" };
        string prefix = "dad";

        var words = from d in dict
                    where d.StartsWith(prefix)
                    select d.Substring(prefix.Length);

answered Dec 10 '11 at 12:55

wolfovercats

1

Simple for loop is faster than this, I bet :) – Saeed Amiri Dec 10 '11 at 13:05

What algorithms are used for seaching subwords in words?

6 Answers6