0

I'm attempting to create a daily time series dataset from what is currently observed only periodically. I can successfully perform the desired operation for a single case but can't work out how to scale to the entire dataset. For example:

        UNIT <- c(100,100, 200, 200, 200, 200, 200, 300, 300, 300,300)
        STATUS <- c('ACTIVE','INACTIVE','ACTIVE','ACTIVE','INACTIVE','ACTIVE','INACTIVE','ACTIVE','ACTIVE',
                    'ACTIVE','INACTIVE') 
        TERMINATED <- as.Date(c('1999-07-06' , '2008-12-05' , '2000-08-18' , '2000-08-18' ,'2000-08-18' ,'2008-08-18',
                        '2008-08-18','2006-09-19','2006-09-19' ,'2006-09-19' ,'1999-03-15')) 
        START <- as.Date(c('2007-04-23','2008-12-06','2004-06-01','2007-02-01','2008-04-19','2010-11-29','2010-12-30',
                   '2007-10-29','2008-02-05','2008-06-30','2009-02-07'))
        STOP <- as.Date(c('2008-12-05','2012-12-31','2007-01-31','2008-04-18','2010-11-28','2010-12-29','2012-12-31',
                  '2008-02-04','2008-06-29','2009-02-06','2012-12-31'))
        TEST <- data.frame(UNIT,STATUS,TERMINATED,START,STOP)
        TEST                   

Which is observations on units over intervals:

   UNIT   STATUS TERMINATED      START       STOP
1   100   ACTIVE 1999-07-06 2007-04-23 2008-12-05
2   100 INACTIVE 2008-12-05 2008-12-06 2012-12-31
3   200   ACTIVE 2000-08-18 2004-06-01 2007-01-31
4   200   ACTIVE 2000-08-18 2007-02-01 2008-04-18
5   200 INACTIVE 2000-08-18 2008-04-19 2010-11-28
6   200   ACTIVE 2008-08-18 2010-11-29 2010-12-29
7   200 INACTIVE 2008-08-18 2010-12-30 2012-12-31
8   300   ACTIVE 2006-09-19 2007-10-29 2008-02-04
9   300   ACTIVE 2006-09-19 2008-02-05 2008-06-29
10  300   ACTIVE 2006-09-19 2008-06-30 2009-02-06
11  300 INACTIVE 1999-03-15 2009-02-07 2012-12-31            

I'd like to take each unit and duplicate the values on "STATUS" and "TERMINATE" (along with N other covariates in the large dataset) daily, over the entire range of the START and END dates. Doing it for a single record....

        A <-  seq(TEST$START[1], TEST$STOP[1], "days") #vector of relevant date sequences 

        #keeping the old data, now with daily date "fill" 
        B <- matrix(NA, length(A), dim(TEST[-c(4,5)])[2]) 
        C <- data.frame(A,B)

        #carry forward observations on covariates through date range 
        TEST[-c(4,5)][1,]  #note terminated has the proper date status:
        UNIT STATUS TERMINATED
         1  100 ACTIVE 1999-07-06

        #now the TERMINATED loses its 'date' status for some reason
        C[-c(1)][1,] <- TEST[-c(4,5)][1,] 
        D <-  na.locf(C)
        colnames(D)[2:4] <-colnames(TEST)[1:3]
        colnames(D)[1] <- "DATE"
        head(D)

        DATE UNIT STATUS TERMINATED
1 2007-04-23  100      1      10778
2 2007-04-24  100      1      10778
3 2007-04-25  100      1      10778
4 2007-04-26  100      1      10778
5 2007-04-27  100      1      10778
6 2007-04-28  100      1      10778

The observations for the first row are duplicated over the range of START to END and a new vector is created: a daily time series for the entire period. I would like to do this for row 2, bind it to D and so on by UNIT of analysis. I have written a for loop with na.locf in an unsuccessful attempt to generalize:

for(i in 1:nrow(TEST)){
  for(j in 0:nrow(TEST)-1) {
  A <-  seq(TEST$START[i], TEST$STOP[i], "days")

  B <- matrix(NA, length(A), dim(TEST[-c(4,5)])[2])
  C <- data.frame(A,B)

  C[-c(1)][1,] <- TEST[-c(4,5)][i,] 
  assign(paste("D",i, sep=""),na.locf(C)) 

  #below here the code does not work. R does not recognize i and j as I intend
  #I haven't been able to overcome this using assign, evaluate etc. 
  colnames(Di)[2:4] <-colnames(TEST)[1:3]
  colnames(Di)[1] <- "DATE"

  D0 <- matrix(NA, 1, dim(Di)[2])
  assign(paste("D", j, sep = ""),Dj)
  rbind(Di,Dj)

   }
  }            

The obvious problem with the single record "solution" is dealing with the "TERMINATED" Date. Just prior to using na.locf it loses it's Date status.

I'm hoping there is a much better way of looking at this and I have just buried myself in complication out of ignorance.

hubert_farnsworth
  • 797
  • 2
  • 9
  • 21

1 Answers1

2

It is relatively easy to do in SQL, so you can use sqldf, which treats data.frames as SQL tables.

dates <- data.frame( date = seq.Date( min(TEST$START), max(TEST$STOP), by = 1 ) )
library(sqldf)
result <- sqldf( "
  SELECT *
  FROM TEST, dates
  WHERE START <= date AND date <= STOP
" )
head( result )

If the data is large, it may be worthwhile to store the data in a database, and do the computations there.

# With SQLite, a database is just a file
library(RSQLite)
connection <- dbConnect( SQLite(), "/tmp/test.db" )  

# Copy the data.frames to the "Test" and "Dates" table.
# When transfering data across systems, it is often easier 
# to convert dates to strings.
convert_dates <- function(d) {
  as.data.frame( lapply( 
    d, 
    function(u) if( "Date" %in% class(u) ) as.character(u) else u 
  ) ) 
}
dbWriteTable(connection, "Test",  convert_dates(TEST),  row.names = FALSE )
dbWriteTable(connection, "Dates", convert_dates(dates), row.names = FALSE )

# Check how many rows the query has: it could be 
# that the result does not fit in memory
dbGetQuery( connection, "
  SELECT COUNT(*) 
  FROM   Test, Dates 
  WHERE  start <= date AND date <= stop
" )

# If it is reasonable, retrieve all the data
dbGetQuery( connection, "
  SELECT * 
  FROM   Test, Dates 
  WHERE  start <= date AND date <= stop
" )

# If not, only retrieve what you need
dbGetQuery( connection, "
  SELECT * 
  FROM   Test, Dates 
  WHERE  start <= date AND date <= stop
  AND    '2013-04-01' <= date AND date <= '2013-04-30'
" )
Vincent Zoonekynd
  • 31,893
  • 5
  • 69
  • 78
  • great, thanks! This is a really useful package for data management. – hubert_farnsworth Mar 31 '13 at 01:55
  • any tips on using this package with a big data frame? I'm currently getting the 'can't allocate vector of size N' problem – hubert_farnsworth Mar 31 '13 at 22:48
  • If the data is too large, you can do everything in the database: I have updated my answer accordingly. (But the volume of data may be explained by some dates very far in the future, e.g., `4712-12-31` in your example.) – Vincent Zoonekynd Apr 01 '13 at 09:28
  • thanks! That was a typo in the STOP vector, which I've now changed to 2012-12-31. I think the result will be atleast 1million rows of data. I suspect the result just doesn't fit in memory, so I might need to go to the cloud. The solution you suggest runs on my toy example using my Mac but the big database is stored on a windows server and I get the following error after 'connection' Error in sqliteNewConnection(drv, ...) : RS-DBI driver: (could not connect to dbname: unable to open database file Any idea where this comes from? I can't find anything in the package manual. – hubert_farnsworth Apr 01 '13 at 23:41
  • It means that the file `/tmp/test.db` cannot be created, because there is no `/tmp` directory. Just change the file name. – Vincent Zoonekynd Apr 02 '13 at 06:38
  • ah thanks! It looks like the data is still too large. The row count works, about 47 million rows, but it wont retrieve all the data. My understanding of SQL doesn't go beyond Wikipedia but shouldn't I be able to deal with as much memory as the harddrive holds using this method? – hubert_farnsworth Apr 03 '13 at 06:14
  • SQLite does not have any problem with the size of the data (indeed, it can count how many rows there are); the problem is that you try to retrieve all the data, from the database (on disk) to a data.frame (in memory). Depending on what you want to do with the data, you may be able to do it directly in SQL, or split the data into smaller datasets to be processed separately. – Vincent Zoonekynd Apr 03 '13 at 08:02
  • I did it in SQL by splitting the data.frame into subsets and the operations worked fine in SQL. The problem is putting the pieces back together in R after all the work is done... I can create N data.frames of the subsets but can't make any use of them after that. Is it possible to do the equivalent of rbind() on individual files within SQL and output the result as a .csv or .Rdata instead? – hubert_farnsworth Apr 03 '13 at 10:01
  • An RData file would not be very useful if the data is too large to be loaded in R. To create a CSV file (I also wonder what you will be able to do with it, given its size) from the SQLite file, I would use the command line (no need to cut the data into pieces): `echo -e ".mode csv\n.header on\n.output /tmp/test.csv\nSELECT * FROM Test;" | sqlite3 /tmp/test.db` (you will need to change the file names). – Vincent Zoonekynd Apr 03 '13 at 13:23