1

I have a dictionary (approx 200k words) which is in alphabetical order. I only need to know if a string of letters can still become a word after a character has been inserted anywhere in the string. In other words: I need to know which characters inserted at a specific position in the word would still make it possible to later make it a word with insertions later on. The order (as in before or after each other) of letters in the string should be maintained.

I just don't think this is possible with descent performance (under half a second) without using some immense data structure or giving up on the accuracy. But even if you would compromise accuracy, I don't know of any good method which would give me a good accuracy with a very very high precision (almost all possible corrects found are indeed possible) and being somewhat balanced at the same time. I wonder if other people see a way. Here is what I think is necessary:

  • Visit every word in the dictionary for 100% accuracy and precision, or make a compromise between speed and accuracy & precision.
  • Check if the word you visit has the characters in the right order to match with the string.
  • Check if an extra letter fits in this word

Does anyone know how to get a good combination between speed and accuracy? Right now I have a data-structure that can find out if something is a word very fast, so I thought as a last resort to just querying the database with random letters at random positions and ask the data-structure if it can still become a word afterwards, but this feels like an unbalanced way of doing this with no constant time.

Joop
  • 3,706
  • 34
  • 55

1 Answers1

1

They way this question is phrased, it's almost like you are suggesting that the answer should include a probabilistic solution such as a Bloom filter, if not in so many words.

However, I think that a deterministic solution is feasible enough given the requirements (less than 0.5 seconds, reasonable memory usage) that it's worth trying to implement and optimize that rather than settling for a an imperfect probabilistic solution.

Supposing that you have a string of characters, and you want to find all possible single character insertions into that string that produce strings that could be turned into valid words with further character insertions, then if the string is n characters long then there are n+1 possible insertion positions and 26 possible characters that can be inserted at each position (assuming unaccented English letters) so there would be 260 possible insertions for a 9 character length string. For each of these, you need to check if they are either valid words or can be turned into valid words with further insertions. Multiplied by 200K entries in a dictionary, that translates to 52 million tests, with each test consisting of "does this string of characters occur in this dictionary entry, in this order". That seems achievable on a modern desktop or smartphone if we can find a way to "early out" most tests.

In pseudo-code, the basic algorithm is:

List findPossibleInsertions(String currentString)
{
    List list = {};
    for(int pos = 0; pos < currentString.length + 1; pos++)
    {
        for(char c = 'a'; c <= 'z'; c++)
        {
            String insertedString = insert c into currentString before pos;
            if(stringIsImpossible(insertedString))
                continue;   // high level test whether the string could be turned into a valid word

            int64 stringMask = computeStringMask(insertedString);

            // the string is not impossible according to the test, but we need to verify that it is actually possible:
            for(String s in Dictionary)
            {
                // check if the string could be turned into s via insertions using a simple mask check to potentially exclude it (but not 100% confirm it):
                if((s.mask & stringMask) != stringMask)
                    continue;   // it's not possible to turn insertedString into s via insertions

                if(s.length < insertedString.length)
                    continue; // can't insert chars to make a shorter string

                // confirm that is it possible:
                if(canMakeStringViaInsertions(insertedString, s)
                {
                    list.add(insertedString); // this is a valid insertion, add to the list
                    break;
                }
            }
        }
    }
}

That leaves us with 3 tasks

  • Find a high level check that a given string cannot possibly be used to create any valid word with further insertions
  • Compute a mask that can be used to test if a given string can possibly be extended to create a given world via insertion, allowing false positive but no false negatives
  • Test definitively whether a given string can be extended to create a given word via insertions, allowing no false positives or negatives

For the first task, we can use precomputed bitmasks to store whether certain sequences of characters can occur in valid words (with the possibility of extra characters being added between any of them). To store sequences of 5 characters, we need 26*26*26*26*26 = 11881376 bits, or 1485172 bytes. Given that this will be roughly equal to the amount of storage needed to store 200K words (given an average word length of 5.1 characters, plus a terminating null, plus a 4-byte offset for each word), I don't think this counts as "enormous".

Store a bitfield for each combination of 3 chars, 4 chars and 5 chars.

Set the bitfields to all zeros, then do a pass through the dictionary. To take the example of 5 chars, for each word, take each possible 5 char sequence, where each char in the sequence occurs ahead of the previous chars in the sequence in the word. For example, the word "pencil" gives the following 5 char sequences:

"encil"
"pncil"
"pecil"
"penil"
"pencl"
"penci"

Add each of these 5-char combinations to the bitfield using this formula:

index = ((s[0]-'a')*(26^4)) + ((s[1]-'a')*(26^3)) + ((s[2]-'a')*(26^2)) + ((s[3]-'a')*26) + (s[4]-'a');
bitfield[index] = 1;

If all the possible 5-char sequences from all the words in the dictionary are added to the bitfield, it implies that if a 5-char sequence occurs in a string but does not have its bit set in the bitfield, it means that it's not possible to create any word in the dictionary by inserting chars to the string, because there are no entries in the dictionary with those 5 chars occurring in that order. Therefore no matter what characters you add, no valid word will result.

The same process can be repeated for bitfield for 4 chars, and 3 chars.

To check if a string can possibly be extended to a valid word in the dictionary using bitfields, use a function like this:

boolean stringIsImpossible(String s)
{
    // test against 5 char bitfield:
    for(i = 0; i <= s.length - 5; i++)
    {
        index = ((s[i]-'a')*(26^4)) + ((s[i+1]-'a')*(26^3)) + ((s[i+2]-'a')*(26^2)) + ((s[i+3]-'a')*26) + (s[i+4]-'a');
        if(5charBitmask[index] == 0)
            return true;
    }
    if(s.length > 4)
        return false;
    // test against 4 char bitfield:
    for(i = 0; i <= s.length - 4; i++)
    {
        index = ((s[i]-'a')*(26^3)) + ((s[i+1]-'a')*(26^2)) + ((s[i+2]-'a')*26) + (s[i+3]-'a');
        if(4charBitmask[index] == 0)
            return true;
    }
    if(s.length > 3)
        return false;
    // test against 3 char bitfield:
    for(i = 0; i <= s.length - 3; i++)
    {
        index = ((s[i]-'a')*(26^2)) + ((s[i+1]-'a')*26) + (s[i+2]-'a');
        if(3charBitmask[index] == 0)
            return true;
    }
    return false;
}

For the second task, it's necessary to create a bitmask for each dictionary word that can be easily used to test if can possibly be created from the existing word string by adding letters. This means that it needs to contain all the letters in the string, in the same order. Logically, if it doesn't contain all the letters in the string, then it can't both contain all the letters in the string AND contain them in the same order. So, we can create a bitmask by setting bit 0 to 1 if the word contains the letter 'a', setting bit 1 if it contains 'b', bit 2 if it contains 'c' etc. Then if we AND the bitmask of the string with the bitmask of the word we are trying to see if we can make by inserting chars into the string, if the result does not equal the bitmask of the string then the word cannot be made from it, because not all the letters in a string are present in the dictionary word.

Furthermore, we can set extra bits in the mask based on whether certain letters appear after certain other letters. For example we could set a bit if there is a letter 'g' in the string, and at some point after that there is a letter 't'. If the bit is set in the string but not in the target word then the word cannot be made from the string. It's also possible to reuse bits to handle more than one letter combination. For example, a bit could be set if there is a 'g' followed by a 't', OR there is a 'd' followed by a 'j' etc. The possibilty of collision is reduced because in the case where there is a 'g' followed by a 't', the 'g' and 't' bits would be set, so matching against a word with a 'd' followed by a 'j', there might be a collision on the shared bit, but the individual 'd' and 'j' bits would most likely not be set. As long as there are no false negatives, some false positives are acceptable.

The function for computing the mask for a string would be something along these lines:

int64 computeStringMask(String s)
{
    int64 mask = 0;
    // add individual letters to bitmask:
    for(int i = 0; i < s.length; i++)
    {
        mask |= 1 << (s[i]-'a');
    }
    // add "followed by" letter combinations to bitmask:
    for(int i = 0; i < s.length-1; i++)
    {
        for(int j = i+1; j < s.length; j++)
        {
            mask |= 1 << (((((s[i]-'a') * 26) + (s[j]-'a')) % 37) + 26);
        }
    }
    return mask;
}

This mask would need to be computed and stored for each string in the dictionary.

Third task: to test if a given string can be extended to create a given word, it's just matter of checking that the word contains each char in the string, in the correct order:

boolean canMakeStringViaInsertions(s, word)
{
    int i = 0; j = 0;
    while(word[j] != 0)
    {
        if(s[i] == word[j])
        {
            // match!
            i++;
            if(s[i] == 0)
                return true;   // all chars have matched
        }
        j++;
    }
    return false;
}

A further optimization to the findPossibleInsertions() function is to divide the dictionary into blocks, and compute string masks for each word in a block and OR them altogether. If the mask computed from a string tests negative against the block mask, then none of the words in the block need to be tested.

samgak
  • 23,944
  • 4
  • 60
  • 82
  • You're welcome, it's an interesting puzzle and it was fun to try and solve it. I've posted a followup question inspired by it here: http://stackoverflow.com/questions/30158878/what-is-the-optimal-way-to-choose-a-set-of-features-for-excluding-items-based-on Also, if and when you implement it I would be curious to know what the performance is like and what percentage of string tests are excluded by the early-out bitmask testing. – samgak May 11 '15 at 02:44