1

I'm an economist and now I'm analysing some qualitative and text data. This is new for me.

I want to create a Markov Model for text predicton based on my interviews corpora. I have analyzed a corpora with tm package and after creating a DocumentTermMatrix and the TermDocumentMatrix (is equivalent) with bigrams (pairs of words), I want to compute the probability matrix for each pair of words in order to use it for further Markov Chain prediction. So, I have tried this piece from http://www.salemmarafi.com/code/twitter-naive-bayes/

probabilityMatrix <-function(docMatrix)
{
  # Sum up the term frequencies
  termSums<-cbind(colnames(as.matrix(docMatrix)),as.numeric(colSums(as.matrix(docMatrix))))
  # Add one
  termSums<-cbind(termSums,as.numeric(termSums[,2])+1)
  # Calculate the probabilties
  termSums<-cbind(termSums,(as.numeric(termSums[,3])/sum(as.numeric(termSums[,3]))))
  # Calculate the natural log of the probabilities
  termSums<-cbind(termSums,log(as.numeric(termSums[,4])))
  # Add pretty names to the columns
  colnames(termSums)<-c("term","count","additive","probability","lnProbability")
  termSums
   } 

But I'm sure that this is not a correct approach to my problem because this code compute the frequency of each pair, but not consider the transition probability from a word to another. I have also seen that there are some implementations of text prediction algorithms in phyton, also in Java (see github), but I'm not able to translate it to R. Some people has a piece of code to perform this kind of analysis in R or know a package that performs it directly?

Thanks in advance

Jose

JosePerles
  • 113
  • 1
  • 7

0 Answers0