0

I want to get the transition matrix for building a Markov chain model to build a recommender system. My data is in the form

            Date    StudentID   Subjectid
            201601   123        1
            201601   234        4
            201601   122        2
            201602   123        3
            201602   123        1
            201602   234        2
            201603   122        3

I want to predict the next three subject that the student is most likely to pick. I am finding it difficult to get this data in the form of transition matrix so that I can build a markov chain model.

I have tried the following code but I am not sure how the transition matrix will be generated. Please help!

              rf <- (data$Subjectid)
              n <- (length(train$Subjectid))
              trf <- table(data.frame(data$Subjectid[1:(n-
               2)],data$Subjectid[1:(n-1)],data$Subjectid[2:n]))
                trf/rowSums(trf)
lmo
  • 37,904
  • 9
  • 56
  • 69
abhi
  • 53
  • 9

2 Answers2

1

To create a transition matrix, there is already a post regarding that. Your data should look something like this:

df1 <- as.data.frame.matrix(table(data[,c("StudentID","Subjectid")]))
#function
trans.matrix <- function(X, prob=T)
{
    tt <- table( c(X[,-ncol(X)]), c(X[,-1]) )
    if(prob) tt <- tt / rowSums(tt)
    tt
}
transition_df <- trans.matrix(as.matrix(df1))

then you can use this:

install.packages('markovchain')
library(markovchain)
...
zdeeb
  • 142
  • 9
  • are you trying to implement this paper? http://jmlr.csail.mit.edu/papers/volume6/shani05a/shani05a.pdf – zdeeb Jul 22 '17 at 16:50
0

There are probably fancier solutions, but this returns the transition count matrix, if I understood what you are looking for correctly.

df = read.table(text="Date    StudentID   Subjectid
201601   123        1
201601   234        4
201601   122        2
201602   123        3
201602   123        1
201602   234        2
201603   122        3",header=T)

library(dplyr)
library(tidyr)

df1 = do.call(rbind,lapply(split(df,df$StudentID), function(x) {x$prev_id = c(NA,x$Subjectid[1:(nrow(x)-1)]); return(x)} ))

df1$prev_id = factor(df1$prev_id,levels=unique(sort(c(df1$prev_id,df1$Subjectid))))
df1$Subjectid = factor(df1$Subjectid,levels=unique(sort(c(df1$prev_id,df1$Subjectid))))

df1 = df1[!is.na(df1$prev_id),] %>% group_by(Subjectid,prev_id) %>% 
  tally %>% spread(Subjectid,n,drop=FALSE,fill=0) %>% as.data.frame

Output:

  prev_id 1 2 3 4
1       1 0 0 1 0
2       2 0 0 1 0
3       3 1 0 0 0
4       4 0 1 0 0
Florian
  • 24,425
  • 4
  • 49
  • 80