I'm attempting to create a daily time series dataset from what is currently observed only periodically. I can successfully perform the desired operation for a single case but can't work out how to scale to the entire dataset. For example:
UNIT <- c(100,100, 200, 200, 200, 200, 200, 300, 300, 300,300)
STATUS <- c('ACTIVE','INACTIVE','ACTIVE','ACTIVE','INACTIVE','ACTIVE','INACTIVE','ACTIVE','ACTIVE',
'ACTIVE','INACTIVE')
TERMINATED <- as.Date(c('1999-07-06' , '2008-12-05' , '2000-08-18' , '2000-08-18' ,'2000-08-18' ,'2008-08-18',
'2008-08-18','2006-09-19','2006-09-19' ,'2006-09-19' ,'1999-03-15'))
START <- as.Date(c('2007-04-23','2008-12-06','2004-06-01','2007-02-01','2008-04-19','2010-11-29','2010-12-30',
'2007-10-29','2008-02-05','2008-06-30','2009-02-07'))
STOP <- as.Date(c('2008-12-05','2012-12-31','2007-01-31','2008-04-18','2010-11-28','2010-12-29','2012-12-31',
'2008-02-04','2008-06-29','2009-02-06','2012-12-31'))
TEST <- data.frame(UNIT,STATUS,TERMINATED,START,STOP)
TEST
Which is observations on units over intervals:
UNIT STATUS TERMINATED START STOP
1 100 ACTIVE 1999-07-06 2007-04-23 2008-12-05
2 100 INACTIVE 2008-12-05 2008-12-06 2012-12-31
3 200 ACTIVE 2000-08-18 2004-06-01 2007-01-31
4 200 ACTIVE 2000-08-18 2007-02-01 2008-04-18
5 200 INACTIVE 2000-08-18 2008-04-19 2010-11-28
6 200 ACTIVE 2008-08-18 2010-11-29 2010-12-29
7 200 INACTIVE 2008-08-18 2010-12-30 2012-12-31
8 300 ACTIVE 2006-09-19 2007-10-29 2008-02-04
9 300 ACTIVE 2006-09-19 2008-02-05 2008-06-29
10 300 ACTIVE 2006-09-19 2008-06-30 2009-02-06
11 300 INACTIVE 1999-03-15 2009-02-07 2012-12-31
I'd like to take each unit and duplicate the values on "STATUS" and "TERMINATE" (along with N other covariates in the large dataset) daily, over the entire range of the START and END dates. Doing it for a single record....
A <- seq(TEST$START[1], TEST$STOP[1], "days") #vector of relevant date sequences
#keeping the old data, now with daily date "fill"
B <- matrix(NA, length(A), dim(TEST[-c(4,5)])[2])
C <- data.frame(A,B)
#carry forward observations on covariates through date range
TEST[-c(4,5)][1,] #note terminated has the proper date status:
UNIT STATUS TERMINATED
1 100 ACTIVE 1999-07-06
#now the TERMINATED loses its 'date' status for some reason
C[-c(1)][1,] <- TEST[-c(4,5)][1,]
D <- na.locf(C)
colnames(D)[2:4] <-colnames(TEST)[1:3]
colnames(D)[1] <- "DATE"
head(D)
DATE UNIT STATUS TERMINATED
1 2007-04-23 100 1 10778
2 2007-04-24 100 1 10778
3 2007-04-25 100 1 10778
4 2007-04-26 100 1 10778
5 2007-04-27 100 1 10778
6 2007-04-28 100 1 10778
The observations for the first row are duplicated over the range of START to END and a new vector is created: a daily time series for the entire period. I would like to do this for row 2, bind it to D and so on by UNIT of analysis. I have written a for loop with na.locf in an unsuccessful attempt to generalize:
for(i in 1:nrow(TEST)){
for(j in 0:nrow(TEST)-1) {
A <- seq(TEST$START[i], TEST$STOP[i], "days")
B <- matrix(NA, length(A), dim(TEST[-c(4,5)])[2])
C <- data.frame(A,B)
C[-c(1)][1,] <- TEST[-c(4,5)][i,]
assign(paste("D",i, sep=""),na.locf(C))
#below here the code does not work. R does not recognize i and j as I intend
#I haven't been able to overcome this using assign, evaluate etc.
colnames(Di)[2:4] <-colnames(TEST)[1:3]
colnames(Di)[1] <- "DATE"
D0 <- matrix(NA, 1, dim(Di)[2])
assign(paste("D", j, sep = ""),Dj)
rbind(Di,Dj)
}
}
The obvious problem with the single record "solution" is dealing with the "TERMINATED" Date. Just prior to using na.locf it loses it's Date status.
I'm hoping there is a much better way of looking at this and I have just buried myself in complication out of ignorance.