2

Other questions
There is another question asking how to build a second order transition matrix, however the answer does not seem to produce a second order transition matrix.

Second order transition matrix & scoring a sequence
Let's use this dataset:

set.seed(1)
dat<-data.frame(replicate(20,sample(c("A", "B", "C","D"), size = 100, replace=TRUE)))

What would be the best way to build a second order transition matrix such that I can easily score a new sequence I encounter as discussed here. For example, such that I can calculate the probability of observing AAABCAD.


Reaction to Julius Vainora

set.seed(1)
mat <-data.frame(replicate(100,sample(c("AAA", "BBB", "CCC","DDD", "ABC", 'ABD'), size = 5, replace=TRUE)))

aux <- apply(mat, 2, function(col) rbind(paste0(head(col, -2), head(col[-1], -1)), col[-1:-2]))
aux <- data.frame(t(matrix(aux, nrow = 2)))
names(aux) <- c("From", "To")
head(aux, 3)
TM <- table(aux)
TM <- TM / rowSums(TM)


x <- as.character(unlist(mat[1,]))
transitions <- cbind(paste0(head(x, -2), head(x[-1], -1)), x[-1:-2])

prAA <- 1 / (4 * 4)
prAA * prod(TM[transitions])

When I ran this code it gave me a probability of 0, however the sequence for which I calculated the probability was also used to build the transition matrix (namely the first row of the df, here mat). I suppose this should not happen since the sequence was used to build the transition matrix so none of the transitions can be zero right?

Moreover, when I change the mat creation to this line:

mat <-data.frame(replicate(10,sample(c("AAA", "BBB", "CCC","DDD", "ABC", 'ABD'), size = 5, replace=TRUE)))

It will give the error Error in [.default (TM, transitions) : subscript out of bounds

CodeNoob
  • 1,988
  • 1
  • 11
  • 33
  • What exactly is wrong with the other question? What should the second order transition matrix look like (ie how will you know when the answer is correct)? – MrFlick Apr 10 '19 at 16:53
  • @MrFlick See the comment by whuber on the first answer and see the second answer for an example of a second order transition matrix: https://stats.stackexchange.com/questions/147164/fit-and-evaluate-a-second-order-transition-matrix-markov-process-in-r – CodeNoob Apr 10 '19 at 16:55

1 Answers1

1

Let's start with the data coming in a matrix format:

set.seed(1)
dat <- replicate(20, sample(c("A", "B", "C", "D"), size = 100, replace = TRUE))

As to estimate the second order transition matrix, we extract the following observed transitions:

aux <- apply(dat, 2, function(col) rbind(paste0(head(col, -2), head(col[-1], -1)), col[-1:-2]))
aux <- data.frame(t(matrix(aux, nrow = 2)))
names(aux) <- c("From", "To")
head(aux, 3)
#   From To
# 1   DD  D
# 2   DD  B
# 3   DB  A

The transition matrix then can be estimated with

TM <- table(aux)
(TM <- TM / rowSums(TM)) # As expected, everything around 0.25
#     To
# From         A         B         C         D
#   AA 0.2459016 0.2950820 0.2049180 0.2540984
#   AB 0.2222222 0.3037037 0.1925926 0.2814815
#   AC 0.3162393 0.1794872 0.1709402 0.3333333
#   AD 0.3211679 0.2189781 0.1824818 0.2773723
#   BA 0.2066116 0.2066116 0.2727273 0.3140496
#   BB 0.2517483 0.2587413 0.2167832 0.2727273
#   BC 0.2647059 0.2745098 0.2254902 0.2352941
#   BD 0.3007519 0.2180451 0.2105263 0.2706767
#   CA 0.2500000 0.2931034 0.2068966 0.2500000
#   CB 0.2178218 0.3168317 0.2178218 0.2475248
#   CC 0.2584270 0.2247191 0.2359551 0.2808989
#   CD 0.3083333 0.2583333 0.2500000 0.1833333
#   DA 0.2402597 0.2727273 0.2272727 0.2597403
#   DB 0.2689076 0.2605042 0.2016807 0.2689076
#   DC 0.2416667 0.2750000 0.2166667 0.2666667
#   DD 0.2442748 0.2213740 0.2671756 0.2671756

In your example we have the sequence and transitions given by

x <- c("A", "A", "A", "B", "C", "A", "D")
(transitions <- cbind(paste0(head(x, -2), head(x[-1], -1)), x[-1:-2]))
#      [,1] [,2]
# [1,] "AA" "A" 
# [2,] "AA" "B" 
# [3,] "AB" "C" 
# [4,] "BC" "A" 
# [5,] "CA" "D" 

Analogously as in my other answer then,

prAA <- 1 / (4 * 4)
prAA * prod(TM[transitions])
# [1] 6.223154e-05

is the probability to observe x, where prAA is the probability (specified by the user) of observing the first two elements of the sequence, AA.


Generalization: n-th order Markov chain.

n <- 3
aux <- apply(dat, 2, function(col) {
  from <- head(apply(embed(col, n)[, n:1], 1, paste, collapse = ""), -1)
  to <- col[-1:-n]
  rbind(from, to)
})
aux <- data.frame(t(matrix(aux, nrow = 2)))
names(aux) <- c("From", "To")
TM <- table(aux)
TM <- TM / rowSums(TM)
head(TM)
#      To
# From          A         B         C         D
#   AAA 0.3541667 0.2083333 0.2083333 0.2291667
#   AAB 0.3103448 0.3103448 0.1724138 0.2068966
#   AAC 0.2142857 0.2857143 0.2857143 0.2142857
#   AAD 0.1463415 0.3902439 0.2439024 0.2195122
#   ABA 0.1200000 0.4800000 0.2000000 0.2000000
#   ABB 0.2424242 0.2727273 0.1515152 0.3333333    

x <- c("A", "A", "A", "B", "C", "A", "D")
(transitions <- cbind(head(apply(embed(x, n)[, n:1], 1, paste, collapse = ""), -1), x[-1:-n]))
#      [,1]  [,2]
# [1,] "AAA" "B" 
# [2,] "AAB" "C" 
# [3,] "ABC" "A" 
# [4,] "BCA" "D" 
prAAA <- 1 / 4^n
prAAA * prod(TM[transitions])
# [1] 3.048129e-05
Julius Vainora
  • 47,421
  • 9
  • 90
  • 102
  • thankyou so much for taking the time again to answer this !! – CodeNoob Apr 10 '19 at 17:14
  • @CodeNoob, no problem, glad it helps! Just fixed a small typo and added an n-th order generalization. – Julius Vainora Apr 10 '19 at 17:28
  • I ran the code for one of my datasets however strangely it gave me a probability of zero even when I used the same sequence that was used to build the transition matrix itself, see my edit. – CodeNoob Apr 10 '19 at 22:09
  • @CodeNoob, everything seems fine except that I take columns of `dat` as individual sequences, not rows. Using `x <- as.character(unlist(mat[, 1]))` works. – Julius Vainora Apr 10 '19 at 23:23
  • aaaah I actually wanted to build the TM based on the rows.. should have been clearer there. I could transpose the df however I think it will slow the code down tremendously when I have a matrix with over 20.000 columns – CodeNoob Apr 11 '19 at 09:02
  • @CodeNoob, 20000 columns alone don't imply that and you would need to transpose only once. Also, don't convert it to a data frame as using a matrix should be faster. Let me know how it goes. – Julius Vainora Apr 11 '19 at 09:06
  • @CodeNoob, actually all you need to do to adjust the code row rows is to have `apply(dat, 1` rather than `apply(dat, 2`. – Julius Vainora Apr 11 '19 at 09:12
  • I guess I should give you some credits when I present my algorithm hahaha, thanks again! – CodeNoob Apr 11 '19 at 09:25