My text generator with Markov chains routine is incomplete how to approach a pos/word/freq data structure?

Question

I want to create a simple text generator with markov chains. I don't understand how the java 'random' routines are used and what datastructures to use?

For example, let's say I have a routine to load a document and then a markov routine to generate a document based on the lone structure. How would I modify/create the generate document routine?

public class MarkovGenerator {

    DataStructure wordFreqMapByPos = new DataStructure();
    public void train(String doc) {
        for (word : doc) {
            // Add word to the pos,
            // Build a word frequency map AT THE POSITION IN THE DOCUMENT
            wordFreqMapByPos.put(thePos, wordsAtThisPos)
        }
    }

    public void generateDocument() {
       ?????
       for (pos : wordFreqMapByPos) {
            // Generate a word
            // Do I need to weight? generate a word based on how often
            // the word occurs at this position?
            // How?
       }
    }

}

Yes, you need to build a data structure that says how often each word is used in each position. And you use that information for the weighted random selection in the generator. Which part are you having trouble with? — Jim Mischel, Mar 28 '13 at 19:31
What is a position? It is the position in a sentence or position in the document. Or should I just use two words, here is a word and here are the possible words that could common after. — Berlin Brown, Mar 28 '13 at 19:36
Normally you would measure the probability of a word occuring after another. In my last sentence you would count tuples like: `<"Normaly","you">`, `<"you","would">`, `<"would","measure">` etc. This will be done over all words you know and then calculate the probability of `<"Normally", ?>` where `?` is for each word that may follow "normally". You can see this in my implementation (I just replaced words with integers): https://github.com/thomasjungblut/thomasjungblut-common/blob/master/src/de/jungblut/nlp/MarkovChain.java — Thomas Jungblut, Mar 28 '13 at 20:22
And a simple datastructure to use for strings is basically a multi map of a word mapped to n other words and how often the words co-occur. — Thomas Jungblut, Mar 28 '13 at 20:25
Also, I did a short article series on using Markov models a few years ago. It's in C#, and it's character-based (rather than word-based), but you might find it relevant for the background: http://www.informit.com/guides/content.aspx?g=dotnet&seqNum=745 — Jim Mischel, Mar 29 '13 at 14:22

My text generator with Markov chains routine is incomplete how to approach a pos/word/freq data structure?

0 Answers0