1

Suppose I have three sequences:

dat <- list( Seq1 =c("A", "B", "C", "D", "C", "A", "C","D","A","A","B","D"),
             Seq2 = c("C" ,"C" ,"B" ,"A" ,"D" ,"D" ,"A" ,"B","C","D","B","A","D"),
             Seq3 = c("D" ,"A" ,"D" ,"A" ,"D", "B", "B", "A","D","A","D","A"))

these sequence are stored in three different CSV files. I want to calculate first-order markov chain from these data[aggregrated].

t=matrix(nrow = length(actionsoverall),ncol = length(actionsoverall),0)

for(i in files){
y=read.csv(i)$x
yy=as.integer(y)
  for (j in 1:(length(y)-1)) {
  t[yy[j],yy[t+1]]<-t[yy[j],yy[j+1]]+1

 }
}

for (h in 1:length(actionsoverall)) {
  t[h,]<-t[h,]/sum(t[h,])

}

Actually, I want to read the sequence from each of the files (i.e. A to B occurs 2 time from file 1, 1 time from file 2 and 3 times from file 3. A occurs total 10 times. So, the probability will be 6/10.

N.B. If I calculate the transition probability each of the file and average them. Will it be the same?

Lzz0
  • 423
  • 1
  • 4
  • 13
  • 1
    Could you please clarify your question for people who aren't familiar with Markov chain data? What happens when you try this compared to what you're trying to have happen? If this is a purely statistics question and not an R question, you're posting to the wrong place. – shirewoman2 Apr 06 '18 at 18:06
  • Dear sir, don't misunderstand me. I thought the correct mark is only given when all the question is solved. That's why at first I gave you mark but after that I removed because I faced some problem and I was thinking as nobody will answer agian as it was solved. I am really sorry – Lzz0 Apr 07 '18 at 06:17
  • Sir, I am really sorry. I am facing a great difficulty with this problem. I don't know if anyone can solve it or not – Lzz0 Apr 07 '18 at 15:12

1 Answers1

0

Data construction:

dat <- list( seq1 =c( "A", "B", "C","D","C","A", "C","D","A","A","B","D"),
 seq2 =c( "C","C","B","A","D","D","A","B","C","D","B","A","D"),
 seq3 = c("D","A","D","A","C","C","B","A","D","C","D","A"))

This will give you the first order transition counts:

 lapply( dat, function(s) table( s,         # start
                                 c(s[-1],NA) # next
                                 ) ) )

#look at matrix( c( s, c(s[-1],NA) ), ncol=2) to verify

$seq1

s   A B C D
  A 1 2 1 0
  B 0 0 1 1
  C 1 0 0 2
  D 1 0 1 0

$seq2

s   A B C D
  A 0 1 0 2
  B 2 0 1 0
  C 0 1 1 1
  D 1 1 0 1

$seq3

s   A B C D
  A 0 0 1 2
  B 1 0 0 0
  C 0 1 1 1
  D 3 0 1 0

This would accumulate those counts with no averaging:

 Reduce( "+", lapply( dat, function(s) table( s, c(s[-1],NA) ) ) )

s   A B C D
  A 1 3 2 4
  B 3 0 2 1
  C 1 2 2 4
  D 5 1 2 1

This might be one way to get a transition matrix from that result:

prop.table( 
     Reduce( "+", lapply( dat, function(s) table( s, c(s[-1],NA) ) ) ) 
      , 1)  # specifies row-proportions

s           A         B         C         D
  A 0.1000000 0.3000000 0.2000000 0.4000000
  B 0.5000000 0.0000000 0.3333333 0.1666667
  C 0.1111111 0.2222222 0.2222222 0.4444444
  D 0.5555556 0.1111111 0.2222222 0.1111111

This is the new strategy:

newdat <- do.call('rbind', lapply(lapply( dat, function(s) table( s,         
                              c(s[-1],NA) 
                              ) ) , as.data.frame))
str(newdat)
'data.frame':   41 obs. of  3 variables:
 $ s   : Factor w/ 4 levels "A","B","C","D": 1 2 3 4 1 2 3 4 1 2 ...
 $ Var2: Factor w/ 4 levels "A","B","C","D": 1 1 1 1 2 2 2 2 3 3 ...
 $ Freq: int  1 0 1 1 2 0 0 0 1 1 ...

With the newdat-object one can simply do a tabulation on the s and Var2 features using xtabs to get sums:

>  xtabs( Freq ~ s + Var2, newdat)
   Var2
s   A B C D
  A 1 3 1 6
  B 3 1 2 1
  C 1 1 1 3
  D 6 2 1 1

And then redo the prop.table-operation to get the row proportions.

prop.table(xtabs( Freq ~ s + Var2, newdat), 1)
#---------
   Var2
s            A          B          C          D
  A 0.09090909 0.27272727 0.09090909 0.54545455
  B 0.42857143 0.14285714 0.28571429 0.14285714
  C 0.16666667 0.16666667 0.16666667 0.50000000
  D 0.60000000 0.20000000 0.10000000 0.10000000
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Thank you very much. But what will be the condition if one sequence does not contain all the letter( Instead 4 , the seq use 3 ) – Lzz0 Apr 07 '18 at 01:56
  • Gee, it's odd that you ask this since your example did not seem to exhibit such a difficulty. Is your "real" data situation so poorly populated? – IRTFM Apr 07 '18 at 03:30
  • Yes, Actually I have total 38 files which contain avg 350 words ( which must be from 7 certain words). But few files did not use that all the 7 words. That's why I am facing the problem with your code. – Lzz0 Apr 07 '18 at 03:39
  • I have updated the data. Suppose, seq 3 didn't use "C" at all. My actual problem is like that. Some of the data didn't contain all the word. – Lzz0 Apr 07 '18 at 03:55
  • This is a problem which has 7 variables. 38 students solved that. Some used 300 variables and some used 400 and so on variables sequentially. But there is no specific number and restriction of variables.So, Some used 5 variables while some used 6 variables sequentially to solve the problem – Lzz0 Apr 07 '18 at 04:51
  • This comment seems tangential to the problem. Do we know in advance that the problem will have a 7x7 transition matrix as its answer? – IRTFM Apr 07 '18 at 04:55
  • Yes, as the number of variables is 7. So, the transition probability will be 7x7. – Lzz0 Apr 07 '18 at 05:15