R: Build second order transition matrix and score sequences

Question

Other questions
There is another question asking how to build a second order transition matrix, however the answer does not seem to produce a second order transition matrix.

Second order transition matrix & scoring a sequence
Let's use this dataset:

set.seed(1)
dat<-data.frame(replicate(20,sample(c("A", "B", "C","D"), size = 100, replace=TRUE)))

What would be the best way to build a second order transition matrix such that I can easily score a new sequence I encounter as discussed here. For example, such that I can calculate the probability of observing AAABCAD.

Reaction to Julius Vainora

set.seed(1)
mat <-data.frame(replicate(100,sample(c("AAA", "BBB", "CCC","DDD", "ABC", 'ABD'), size = 5, replace=TRUE)))

aux <- apply(mat, 2, function(col) rbind(paste0(head(col, -2), head(col[-1], -1)), col[-1:-2]))
aux <- data.frame(t(matrix(aux, nrow = 2)))
names(aux) <- c("From", "To")
head(aux, 3)
TM <- table(aux)
TM <- TM / rowSums(TM)


x <- as.character(unlist(mat[1,]))
transitions <- cbind(paste0(head(x, -2), head(x[-1], -1)), x[-1:-2])

prAA <- 1 / (4 * 4)
prAA * prod(TM[transitions])

When I ran this code it gave me a probability of 0, however the sequence for which I calculated the probability was also used to build the transition matrix (namely the first row of the df, here mat). I suppose this should not happen since the sequence was used to build the transition matrix so none of the transitions can be zero right?

Moreover, when I change the mat creation to this line:

mat <-data.frame(replicate(10,sample(c("AAA", "BBB", "CCC","DDD", "ABC", 'ABD'), size = 5, replace=TRUE)))

It will give the error Error in [.default (TM, transitions) : subscript out of bounds

What exactly is wrong with the other question? What should the second order transition matrix look like (ie how will you know when the answer is correct)? — MrFlick, Apr 10 '19 at 16:53
@MrFlick See the comment by whuber on the first answer and see the second answer for an example of a second order transition matrix: https://stats.stackexchange.com/questions/147164/fit-and-evaluate-a-second-order-transition-matrix-markov-process-in-r — CodeNoob, Apr 10 '19 at 16:55

Julius Vainora · Accepted Answer · 2019-04-10T18:44:58.963

Let's start with the data coming in a matrix format:

set.seed(1)
dat <- replicate(20, sample(c("A", "B", "C", "D"), size = 100, replace = TRUE))

As to estimate the second order transition matrix, we extract the following observed transitions:

aux <- apply(dat, 2, function(col) rbind(paste0(head(col, -2), head(col[-1], -1)), col[-1:-2]))
aux <- data.frame(t(matrix(aux, nrow = 2)))
names(aux) <- c("From", "To")
head(aux, 3)
#   From To
# 1   DD  D
# 2   DD  B
# 3   DB  A

The transition matrix then can be estimated with

TM <- table(aux)
(TM <- TM / rowSums(TM)) # As expected, everything around 0.25
#     To
# From         A         B         C         D
#   AA 0.2459016 0.2950820 0.2049180 0.2540984
#   AB 0.2222222 0.3037037 0.1925926 0.2814815
#   AC 0.3162393 0.1794872 0.1709402 0.3333333
#   AD 0.3211679 0.2189781 0.1824818 0.2773723
#   BA 0.2066116 0.2066116 0.2727273 0.3140496
#   BB 0.2517483 0.2587413 0.2167832 0.2727273
#   BC 0.2647059 0.2745098 0.2254902 0.2352941
#   BD 0.3007519 0.2180451 0.2105263 0.2706767
#   CA 0.2500000 0.2931034 0.2068966 0.2500000
#   CB 0.2178218 0.3168317 0.2178218 0.2475248
#   CC 0.2584270 0.2247191 0.2359551 0.2808989
#   CD 0.3083333 0.2583333 0.2500000 0.1833333
#   DA 0.2402597 0.2727273 0.2272727 0.2597403
#   DB 0.2689076 0.2605042 0.2016807 0.2689076
#   DC 0.2416667 0.2750000 0.2166667 0.2666667
#   DD 0.2442748 0.2213740 0.2671756 0.2671756

In your example we have the sequence and transitions given by

x <- c("A", "A", "A", "B", "C", "A", "D")
(transitions <- cbind(paste0(head(x, -2), head(x[-1], -1)), x[-1:-2]))
#      [,1] [,2]
# [1,] "AA" "A" 
# [2,] "AA" "B" 
# [3,] "AB" "C" 
# [4,] "BC" "A" 
# [5,] "CA" "D"

Analogously as in my other answer then,

prAA <- 1 / (4 * 4)
prAA * prod(TM[transitions])
# [1] 6.223154e-05

is the probability to observe x, where prAA is the probability (specified by the user) of observing the first two elements of the sequence, AA.

Generalization: n-th order Markov chain.

n <- 3
aux <- apply(dat, 2, function(col) {
  from <- head(apply(embed(col, n)[, n:1], 1, paste, collapse = ""), -1)
  to <- col[-1:-n]
  rbind(from, to)
})
aux <- data.frame(t(matrix(aux, nrow = 2)))
names(aux) <- c("From", "To")
TM <- table(aux)
TM <- TM / rowSums(TM)
head(TM)
#      To
# From          A         B         C         D
#   AAA 0.3541667 0.2083333 0.2083333 0.2291667
#   AAB 0.3103448 0.3103448 0.1724138 0.2068966
#   AAC 0.2142857 0.2857143 0.2857143 0.2142857
#   AAD 0.1463415 0.3902439 0.2439024 0.2195122
#   ABA 0.1200000 0.4800000 0.2000000 0.2000000
#   ABB 0.2424242 0.2727273 0.1515152 0.3333333    

x <- c("A", "A", "A", "B", "C", "A", "D")
(transitions <- cbind(head(apply(embed(x, n)[, n:1], 1, paste, collapse = ""), -1), x[-1:-n]))
#      [,1]  [,2]
# [1,] "AAA" "B" 
# [2,] "AAB" "C" 
# [3,] "ABC" "A" 
# [4,] "BCA" "D" 
prAAA <- 1 / 4^n
prAAA * prod(TM[transitions])
# [1] 3.048129e-05

thankyou so much for taking the time again to answer this !! — CodeNoob, Apr 10 '19 at 17:14
@CodeNoob, no problem, glad it helps! Just fixed a small typo and added an n-th order generalization. — Julius Vainora, Apr 10 '19 at 17:28
I ran the code for one of my datasets however strangely it gave me a probability of zero even when I used the same sequence that was used to build the transition matrix itself, see my edit. — CodeNoob, Apr 10 '19 at 22:09
@CodeNoob, everything seems fine except that I take columns of `dat` as individual sequences, not rows. Using `x <- as.character(unlist(mat[, 1]))` works. — Julius Vainora, Apr 10 '19 at 23:23
aaaah I actually wanted to build the TM based on the rows.. should have been clearer there. I could transpose the df however I think it will slow the code down tremendously when I have a matrix with over 20.000 columns — CodeNoob, Apr 11 '19 at 09:02
@CodeNoob, 20000 columns alone don't imply that and you would need to transpose only once. Also, don't convert it to a data frame as using a matrix should be faster. Let me know how it goes. — Julius Vainora, Apr 11 '19 at 09:06
@CodeNoob, actually all you need to do to adjust the code row rows is to have `apply(dat, 1` rather than `apply(dat, 2`. — Julius Vainora, Apr 11 '19 at 09:12
I guess I should give you some credits when I present my algorithm hahaha, thanks again! — CodeNoob, Apr 11 '19 at 09:25

R: Build second order transition matrix and score sequences

1 Answers1

Linked