I have the following problem: I have a data set like the one below:
library(data.table)
dt <-
data.table(
student = c(rep(1, 8), rep(2, 8)),
year = rep(2001:2008, 2),
track = c(rep("Highschool", 3), rep("Vocational", 2), rep("Uni", 1), rep("Vocational", 2),
rep("Vocational", 2), rep("Highschool", 4), rep("Vocational", 2))
)
# student year track
# 1: 1 2001 Highschool
# 2: 1 2002 Highschool
# 3: 1 2003 Highschool
# 4: 1 2004 Vocational
# 5: 1 2005 Vocational
# 6: 1 2006 Uni
# 7: 1 2007 Vocational
# 8: 1 2008 Vocational
# 9: 2 2001 Vocational
#10: 2 2002 Vocational
#11: 2 2003 Highschool
#12: 2 2004 Highschool
#13: 2 2005 Highschool
#14: 2 2006 Highschool
#15: 2 2007 Vocational
#16: 2 2008 Vocational
As you can see, the data chronologically tracks in which type of education a student is in a given year. I would like to assign unique identifiers to the track the number of program type that a student is enrolled in while keeping the chronological ordering. Hence, I would like that my data.table looks like this:
dt[, tracker := c(rep(1, 3), rep(2, 2), rep(3, 1), rep(4, 2),
rep(1, 2), rep(2, 4), rep(3, 2))]
# student year track tracker
# 1: 1 2001 Highschool 1
# 2: 1 2002 Highschool 1
# 3: 1 2003 Highschool 1
# 4: 1 2004 Vocational 2
# 5: 1 2005 Vocational 2
# 6: 1 2006 Uni 3
# 7: 1 2007 Vocational 4
# 8: 1 2008 Vocational 4
# 9: 2 2001 Vocational 1
#10: 2 2002 Vocational 1
#11: 2 2003 Highschool 2
#12: 2 2004 Highschool 2
#13: 2 2005 Highschool 2
#14: 2 2006 Highschool 2
#15: 2 2007 Vocational 3
#16: 2 2008 Vocational 3
I have now come up with the following solution:
dt[, helper := ifelse(shift(track) == track, 0, 1)]
dt[1, helper := 0]
dt[, tracker := cumsum(helper) + 1, by = "student"]
dt
# student year track helper tracker
# 1: 1 2001 Highschool 0 1
# 2: 1 2002 Highschool 0 1
# 3: 1 2003 Highschool 0 1
# 4: 1 2004 Vocational 1 2
# 5: 1 2005 Vocational 0 2
# 6: 1 2006 Uni 1 3
# 7: 1 2007 Vocational 1 4
# 8: 1 2008 Vocational 0 4
# 9: 2 2001 Vocational 0 1
#10: 2 2002 Vocational 0 1
#11: 2 2003 Highschool 1 2
#12: 2 2004 Highschool 0 2
#13: 2 2005 Highschool 0 2
#14: 2 2006 Highschool 0 2
#15: 2 2007 Vocational 1 3
#16: 2 2008 Vocational 0 3
Now I am wondering: is there a more "direct" way to achieve my goal either using data.table/dplyr/base syntax?