Group observations chronologically and by group R / data.table

Question

I have the following problem: I have a data set like the one below:

library(data.table)

dt <- 
  data.table(
  student = c(rep(1, 8), rep(2, 8)), 
  year = rep(2001:2008, 2),
  track = c(rep("Highschool", 3), rep("Vocational", 2), rep("Uni", 1), rep("Vocational", 2), 
            rep("Vocational", 2), rep("Highschool", 4), rep("Vocational", 2))
)

#  student year      track
# 1:       1 2001 Highschool
# 2:       1 2002 Highschool
# 3:       1 2003 Highschool
# 4:       1 2004 Vocational
# 5:       1 2005 Vocational
# 6:       1 2006        Uni
# 7:       1 2007 Vocational
# 8:       1 2008 Vocational
# 9:       2 2001 Vocational
#10:       2 2002 Vocational
#11:       2 2003 Highschool
#12:       2 2004 Highschool
#13:       2 2005 Highschool
#14:       2 2006 Highschool
#15:       2 2007 Vocational
#16:       2 2008 Vocational

As you can see, the data chronologically tracks in which type of education a student is in a given year. I would like to assign unique identifiers to the track the number of program type that a student is enrolled in while keeping the chronological ordering. Hence, I would like that my data.table looks like this:

dt[, tracker := c(rep(1, 3), rep(2, 2), rep(3, 1), rep(4, 2), 
                  rep(1, 2), rep(2, 4), rep(3, 2))]
#    student year      track tracker
# 1:       1 2001 Highschool       1
# 2:       1 2002 Highschool       1
# 3:       1 2003 Highschool       1
# 4:       1 2004 Vocational       2
# 5:       1 2005 Vocational       2
# 6:       1 2006        Uni       3
# 7:       1 2007 Vocational       4
# 8:       1 2008 Vocational       4
# 9:       2 2001 Vocational       1
#10:       2 2002 Vocational       1
#11:       2 2003 Highschool       2
#12:       2 2004 Highschool       2
#13:       2 2005 Highschool       2
#14:       2 2006 Highschool       2
#15:       2 2007 Vocational       3
#16:       2 2008 Vocational       3

I have now come up with the following solution:

dt[, helper := ifelse(shift(track) == track, 0, 1)]
dt[1, helper := 0]
dt[, tracker := cumsum(helper) + 1, by = "student"]

dt
# student year      track helper tracker
# 1:       1 2001 Highschool      0       1
# 2:       1 2002 Highschool      0       1
# 3:       1 2003 Highschool      0       1
# 4:       1 2004 Vocational      1       2
# 5:       1 2005 Vocational      0       2
# 6:       1 2006        Uni      1       3
# 7:       1 2007 Vocational      1       4
# 8:       1 2008 Vocational      0       4
# 9:       2 2001 Vocational      0       1
#10:       2 2002 Vocational      0       1
#11:       2 2003 Highschool      1       2
#12:       2 2004 Highschool      0       2
#13:       2 2005 Highschool      0       2
#14:       2 2006 Highschool      0       2
#15:       2 2007 Vocational      1       3
#16:       2 2008 Vocational      0       3

Now I am wondering: is there a more "direct" way to achieve my goal either using data.table/dplyr/base syntax?

s_baldur · Accepted Answer · 2020-07-09T15:11:51.027

data.table::rleid(): Consecutive runs of identical values belong to the same group

dt[, tracker := rleid(track), by = student]

    student year      track tracker
 1:       1 2001 Highschool       1
 2:       1 2002 Highschool       1
 3:       1 2003 Highschool       1
 4:       1 2004 Vocational       2
 5:       1 2005 Vocational       2
 6:       1 2006        Uni       3
 7:       1 2007 Vocational       4
 8:       1 2008 Vocational       4
 9:       2 2001 Vocational       1
10:       2 2002 Vocational       1
11:       2 2003 Highschool       2
12:       2 2004 Highschool       2
13:       2 2005 Highschool       2
14:       2 2006 Highschool       2
15:       2 2007 Vocational       3
16:       2 2008 Vocational       3

Without rleid() just for fun:

dt[, tracker := cumsum(shift(track, fill = track[1]) != track) + 1L, by = student]

Daniel O · Answer 2 · 2020-07-09T15:08:27.487

in Base

dt$tracker <- unsplit(tapply(dt$track,dt$student, function(x) c(1,1+cumsum(diff( as.numeric(factor(x)))!= 0 ))),dt$student)

output:

      student year      track tracker
 1:       1 2001 Highschool       1
 2:       1 2002 Highschool       1
 3:       1 2003 Highschool       1
 4:       1 2004 Vocational       2
 5:       1 2005 Vocational       2
 6:       1 2006        Uni       3
 7:       1 2007 Vocational       4
 8:       1 2008 Vocational       4
 9:       2 2001 Vocational       1
10:       2 2002 Vocational       1
11:       2 2003 Highschool       2
12:       2 2004 Highschool       2
13:       2 2005 Highschool       2
14:       2 2006 Highschool       2
15:       2 2007 Vocational       3
16:       2 2008 Vocational       3

Group observations chronologically and by group R / data.table

2 Answers2