Other questions
There is another question asking how to build a second order transition matrix, however the answer does not seem to produce a second order transition matrix.
Second order transition matrix & scoring a sequence
Let's use this dataset:
set.seed(1)
dat<-data.frame(replicate(20,sample(c("A", "B", "C","D"), size = 100, replace=TRUE)))
What would be the best way to build a second order transition matrix such that I can easily score a new sequence I encounter as discussed here. For example, such that I can calculate the probability of observing AAABCAD
.
Reaction to Julius Vainora
set.seed(1)
mat <-data.frame(replicate(100,sample(c("AAA", "BBB", "CCC","DDD", "ABC", 'ABD'), size = 5, replace=TRUE)))
aux <- apply(mat, 2, function(col) rbind(paste0(head(col, -2), head(col[-1], -1)), col[-1:-2]))
aux <- data.frame(t(matrix(aux, nrow = 2)))
names(aux) <- c("From", "To")
head(aux, 3)
TM <- table(aux)
TM <- TM / rowSums(TM)
x <- as.character(unlist(mat[1,]))
transitions <- cbind(paste0(head(x, -2), head(x[-1], -1)), x[-1:-2])
prAA <- 1 / (4 * 4)
prAA * prod(TM[transitions])
When I ran this code it gave me a probability of 0
, however the sequence for which I calculated the probability was also used to build the transition matrix (namely the first row of the df, here mat
). I suppose this should not happen since the sequence was used to build the transition matrix so none of the transitions can be zero right?
Moreover, when I change the mat creation to this line:
mat <-data.frame(replicate(10,sample(c("AAA", "BBB", "CCC","DDD", "ABC", 'ABD'), size = 5, replace=TRUE)))
It will give the error Error in [.default (TM, transitions) : subscript out of bounds