4

I need help on a topic related to markov chains and preprocessing of data. Suppose I have the following matrix relating individuals to states over time:

     ID Time1 Time2
1 14021     A     A
2 15031     B     A
3 16452     A     C

I would like to obtain, for this matrix, the state transition matrix: Hence, what is required is

  A  B  C
A 1  0  1
B 1  0  0
C 0  0  0

and the same thing, but now weighted by the toal number of transitions from that state, i.e,

  A    B   C
A 0.5  0  0.5
B 1    0   0
C 0    0   0

(as there are two transitions leaving from state A). I know that the markovchain package has a functionality for doing this if one has a sequence, say AAABBAAABBCC, but not if data is set up like I have. Ideally a direct procedure would be great, but if there is some way of turning the data into a set of sequences that would work as well.

M--
  • 25,431
  • 8
  • 61
  • 93
Arrebimbomalho
  • 176
  • 1
  • 12

3 Answers3

3

Here is another base R solution.

df <- data.frame(Time1 = c("A","B","A"), Time2 = c("A","A","C"), stringsAsFactors = FALSE)

myStates <- sort(unique(c(df$Time1, df$Time2)))
lenSt <- length(myStates)

currState <- match(df$Time1, myStates)
nextState <- match(df$Time2, myStates)
transMat <- matrix(0L, lenSt, lenSt)

transMat[cbind(currState, nextState)] <- 1L
transMat <- transMat/rowSums(transMat)
transMat[is.na(transMat)] <- 0

transMat
     [,1] [,2] [,3]
[1,]  0.5    0  0.5
[2,]  1.0    0  0.0
[3,]  0.0    0  0.0
Joseph Wood
  • 7,077
  • 2
  • 30
  • 65
  • Thanks! The dataset I need to convert has literally millions of observations, hundreds of states and dozens of periods...probably not practical :( do you know of any automatic way of turning the dataframe into the df you have at the start of your code? that would be great – Arrebimbomalho Dec 15 '17 at 19:54
  • 1
    @Arrebimbomalho, you could subset your data.frame to only include the columns labeled `Time`. I simply included the creation of the df, because you didn't provide it in your question. In the future, you could `dput` your data structures to help speed up the analysis for people trying to help. Other than subsetting, I'm not really sure I could help much further without seeing some more data. – Joseph Wood Dec 15 '17 at 20:08
3

An igraph approach, so using df from Joseph's answer:

library(igraph)

g <- graph_from_data_frame(df)

E(g)$weight = 1/degree(g, mode="out")[df$Time1] # get counts

as_adj(g, attr = "weight", sparse=FALSE) # output weighted adjacency matrix

    A B   C
A 0.5 0 0.5
B 1.0 0 0.0
C 0.0 0 0.0
user20650
  • 24,654
  • 5
  • 56
  • 91
  • 1
    @Arrebimbomalho This is by far better answer. If I were you, I would accept this. – M-- Dec 15 '17 at 20:04
  • 1
    @Arrebimbomalho, completely agree with Masoud! – Joseph Wood Dec 15 '17 at 20:12
  • 1
    @JosephWood I hope you understand that this is nothing personal. Just thought this answer is more elegant (of course, firstly, comparing to my answer). Cheers. – M-- Dec 15 '17 at 20:15
  • 2
    @Masoud, not at all... I really meant that I agree that this a far better answer. I truly appreciate the attitude of putting the community first. – Joseph Wood Dec 15 '17 at 20:19
2

Definitely there's a better way. This is me doodling with loops on a lame Friday afternoon.

lvls <- sort(unique(unlist(df[,-1])))

dat <- matrix(0, nrow= length(lvls), ncol= length(lvls))

colnames(dat) <- lvls
rownames(dat) <- lvls

concat <- paste0(df[,2], df[,3])

for (i in 1:length(lvls)) {
  for (j in 1:length(lvls)) {
    dat[i,j] <- paste0(rownames(dat)[i], colnames(dat)[j])
  }
}

dat <- matrix(sapply(dat, function(x) length(grep(x, concat))), 
       nrow= length(lvls), ncol= length(lvls))

colnames(dat) <- lvls
rownames(dat) <- lvls

dat

##   A B C
## A 1 0 1
## B 1 0 0
## C 0 0 0

dat <- dat / rowSums(dat)
dat[is.na(dat)] <- 0

dat

##    A B   C
##A 0.5 0 0.5
##B 1.0 0 0.0
##C 0.0 0 0.0
M--
  • 25,431
  • 8
  • 61
  • 93
  • Great! also, it's scalable, so it may be able to handle the job (the real matrix I need has millions of observations...) – Arrebimbomalho Dec 15 '17 at 19:55
  • @Arrebimbomalho those two loops that I have in the middle, would not that fast, but with let's say (1M rows * 100 columns) it won't get more than minutes to complete I guess. – M-- Dec 15 '17 at 19:57
  • seems the first bit could be done with `table` ie `df[c("Time1", "Time2")] <- lapply(df[c("Time1", "Time2")], factor, levels=c("A", "B", "C")) ; table(df)` – user20650 Dec 15 '17 at 19:59
  • @user20650 `attempt to make a table with >= 2^31 elements` That would happen for large datasets tho. Right? – M-- Dec 15 '17 at 20:01
  • 1
    Well only use `table` on the two columns - outout size will depend on the number of unique levels. – user20650 Dec 15 '17 at 20:02