I'm an economist and now I'm analysing some qualitative and text data. This is new for me.
I want to create a Markov Model for text predicton based on my interviews corpora. I have analyzed a corpora with tm package and after creating a DocumentTermMatrix and the TermDocumentMatrix (is equivalent) with bigrams (pairs of words), I want to compute the probability matrix for each pair of words in order to use it for further Markov Chain prediction. So, I have tried this piece from http://www.salemmarafi.com/code/twitter-naive-bayes/
probabilityMatrix <-function(docMatrix)
{
# Sum up the term frequencies
termSums<-cbind(colnames(as.matrix(docMatrix)),as.numeric(colSums(as.matrix(docMatrix))))
# Add one
termSums<-cbind(termSums,as.numeric(termSums[,2])+1)
# Calculate the probabilties
termSums<-cbind(termSums,(as.numeric(termSums[,3])/sum(as.numeric(termSums[,3]))))
# Calculate the natural log of the probabilities
termSums<-cbind(termSums,log(as.numeric(termSums[,4])))
# Add pretty names to the columns
colnames(termSums)<-c("term","count","additive","probability","lnProbability")
termSums
}
But I'm sure that this is not a correct approach to my problem because this code compute the frequency of each pair, but not consider the transition probability from a word to another. I have also seen that there are some implementations of text prediction algorithms in phyton, also in Java (see github), but I'm not able to translate it to R. Some people has a piece of code to perform this kind of analysis in R or know a package that performs it directly?
Thanks in advance
Jose