-1

I just went through a problem, where input is a string which is a single word. This line is not readable,

Like, I want to leave is written as Iwanttoleave.

The problem is of separating out each of the tokens(words, numbers, abbreviations, etc)

I have no idea where to start

The first thought that came to my mind is making a dictionary and then mapping accordingly but I think making a dictionary is not at all a good idea.

Can anyone suggest some algorithm to do it ?

Geek_To_Learn
  • 1,816
  • 5
  • 28
  • 48
  • I also don't know where to start. I suppose if your example `Iwanttoleave` be accurate, then you might have to make use of some sort of dictionary to fish out whole words from the mess. Are you certain there are no sort of delimeters which might help you? – Tim Biegeleisen Dec 15 '15 at 10:16
  • This is very complicated and you have shown no effort (no code is posted) – Idos Dec 15 '15 at 10:20
  • @Idos : I have spent considerable amount of time, and I am not asking for code, Before code, I need to come out with an algo. – Geek_To_Learn Dec 15 '15 at 10:56
  • Algo/Code is essentially the same here. This task is way too broad for the scope here (and also opinion-based) anyway and should be closed. – Idos Dec 15 '15 at 10:57
  • @Idos As per your instructions I have edited the question, the reason I haven't written about dictionary that I thought it is the first and too naive idea to discuss – Geek_To_Learn Dec 15 '15 at 10:59
  • @TimBiegeleisen.. the only information that I was provided was that the string contains alpanumeric letters. – Geek_To_Learn Dec 15 '15 at 11:02
  • 1
    All that is certain, you will *have* to have a dictionary with *all* the English words to accomplish this correctly, which is odd. – Idos Dec 15 '15 at 11:05

2 Answers2

1

Instead of using a Dictionary, I'd suggest you use a Trie with all your valid words (the whole English dictionary?). Then you can start moving one letter at a time in your input line and the trie at the same time. If the letter leads to more results in the trie, you can continue expanding the current word, and if not, you can start looking for a new word in the trie.

This won't be a forward only search for sure, so you'll need some sort of backtracking.

// This method Generates a list with all the matching phrases for the given input
List<string> CandidatePhrases(string input) {
    Trie validWords = BuildTheTrieWithAllValidWords();
    List<string> currentWords = new List<string>();
    List<string> possiblePhrases = new List<string>();
    // The root of the trie has an empty key that points to all the first letters of all words
    Trie currentWord = validWords;
    int currentLetter = -1;
    // Calls a backtracking method that creates all possible phrases
    FindPossiblePhrases(input, validWords, currentWords, currentWord, currentLetter, possiblePhrases);

    return possiblePhrases;
}

// The Trie structure could be something like
class Trie {
    char key;
    bool valid;
    List<Trie> children;
    Trie parent;

    Trie Next(char nextLetter) {
        return children.FirstOrDefault(c => c.key == nextLetter);
    }

    string WholeWord() {
        Debug.Assert(valid);
        string word = "";
        Trie current = this;
        while (current.Key != '\0')
        {
            word = current.Key + word;
            current = current.parent;
        }
    }
}

void FindPossiblePhrases(string input, Trie validWords, List<string> currentWords, Trie currentWord, int currentLetter, List<string> possiblePhrases) {
    if (currentLetter == input.Length - 1) {
        if (currentWord.valid) {
            string phrase = ""
            foreach (string word in currentWords) {
                phrase += word;
                phrase += " ";
            }
            phrase += currentWord.WholeWord();
            possiblePhrases.Add(phrase);
        }
    }
    else {
        // The currentWord may be a valid word. If that's the case, the next letter could be the first of a new word, or could be the next letter of a bigger word that begins with currentWord
        if (currentWord.valid) {
            // Try to match phrases when the currentWord is a valid word
            currentWords.Add(currentWord.WholeWord());
            FindPossiblePhrases(input, validWords, currentWords, validWords, currentLetter, possiblePhrases);
            currentWords.RemoveAt(currentWords.Length - 1);
        }

        // If either the currentWord is a valid word, or not, try to match a longer word that begins with current word
        int nextLetter = currentLetter + 1;
        Trie nextWord = currentWord.Next(input[nextLetter]);
        // If the nextWord is null, there was no matching word that begins with currentWord and has input[nextLetter] as the following letter.
        if (nextWord != null) {
            FindPossiblePhrases(input, validWords, currentWords, nextWord, nextLetter, possiblePhrases);
        }
    }
}    
Fede
  • 3,928
  • 1
  • 20
  • 28
  • This is a very incomplete answer. Also, the use of trie as you mentioned would only be good to save some some space. Almost all the used words in English are less than 10 characters in length, so trie is not at all helping in time complexity. At the same time, you did not mention how you would actually split the words at all. – vish4071 Dec 15 '15 at 12:05
  • The use of a Trie is not just to save space, but rather to allow searching for words one letter at a time. With a dictionary, you have to go splitting the tokens by guessing where words are, and then looking if the resulting substring is in the dictionary. With the trie you have to guess less. I'll update my answer with some code. – Fede Dec 15 '15 at 12:17
  • I said that *here*, trie would not be very efficient (in time) as the words to search are very small. Also, its implementation is not very easy and why put an effort that does not help much? – vish4071 Dec 15 '15 at 12:20
  • @vish4071 I edited my answer. I hope you find it less incomplete, and can check how I'm splitting the words. If the input string is long enough (say 50 characters), I'm willing to bet this will outperform the dictionary implementation. – Fede Dec 15 '15 at 13:08
1

First of all, create a dictionary which helps you to identify if some string is a valid word or not.

bool isValidString(String s){
    if(dictionary.contains(s))
        return true;
    return false;
}

Now, you can write a recursive code to split the string and create an array of actually useful words.

ArrayList usefulWords = new ArrayList<String>;      //global declaration
void split(String s){
    int l = s.length();
    int i,j;
    for(i = l-1; i >= 0; i--){
        if(isValidString(s.substr(i,l)){     //s.substr(i,l) will return substring starting from index `i` and ending at `l-1`
            usefulWords.add(s.substr(i,l));
            split(s.substr(0,i));
        }
    }
}

Now, use these usefulWords to generate all possible strings. Maybe something like this:

ArrayList<String> splits = new ArrayList<String>[10];   //assuming max 10 possible outputs
ArrayList<String>[] allPossibleStrings(String s, int level){
    for(int i = 0; i <  s.length(); i++){
        if(usefulWords.contains(s.substr(0,i)){
            splits[level].add(s.substr(0,i));
            allPossibleStrings(s.substr(i,s.length()),level);
            level++;
        }
    }
}

Now, this code gives you all possible splits in a somewhat arbitrary manner. eg.

dictionary = {cat, dog, i, am, pro, gram, program, programmer, grammer}

input:
string = program
output:
splits[0] = {pro, gram}
splits[1] = {program}

input:
string = iamprogram
output:
splits[0] = {i, am, pro, gram}   //since `mer` is not in dictionary
splits[1] = {program}

I did not give much thought to the last part, but I think you should be able to formulate a code from there as per your requirement.

Also, since no language is tagged, I've taken the liberty of writing the code in JAVA-like syntax as it is really easy to understand.

vish4071
  • 5,135
  • 4
  • 35
  • 65