7

I am not sure that I can ask this question here, let me know if I should do it somewhere else.

I have a data.table with 1e6 rows having this structure:

        V1       V2     V3
1: 03/09/2011 08:05:40 1145.0
2: 03/09/2011 08:06:01 1207.3
3: 03/09/2011 08:06:17 1198.8
4: 03/09/2011 08:06:20 1158.4
5: 03/09/2011 08:06:40 1112.2
6: 03/09/2011 08:06:59 1199.3

I am converting the V1 and V2 variables to a unique datetime variable, using this code:

 system.time(DT[,`:=`(index= as.POSIXct(paste(V1,V2),
                         format='%d/%m/%Y %H:%M:%S'),
                     V1=NULL,V2=NULL)])

   user  system elapsed 
   47.47    0.16   50.27 

Is there any method to improve performance of this transformation?

Here the dput(head(DT)):

DT <- structure(list(V1 = c("03/09/2011", "03/09/2011", "03/09/2011", 
"03/09/2011", "03/09/2011", "03/09/2011"), V2 = c("08:05:40", 
"08:06:01", "08:06:17", "08:06:20", "08:06:40", "08:06:59"), 
    V3 = c(1145, 1207.3, 1198.8, 1158.4, 1112.2, 1199.3)), .Names = c("V1", 
"V2", "V3"), class = c("data.table", "data.frame"), row.names = c(NA, 
-6L), .internal.selfref = <pointer: 0x00000000002a0788>)
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
agstudy
  • 119,832
  • 17
  • 199
  • 261
  • 2
    If your dates are GMT and greater than Jan 1, 1970 you can try `fasttime` package. – Alex Popov Dec 10 '13 at 00:01
  • @aseidlitz Thanks. I forget about `fasttime`.I think it is a good condidate. But I isn't used internally by data.table package? – agstudy Dec 10 '13 at 00:06
  • There were some discussions about adding fasttime support to data.table but as far as I can see it's still an open request on [R Forge](https://r-forge.r-project.org/tracker/?func=detail&atid=978&aid=2582&group_id=240) – Alex Popov Dec 10 '13 at 00:13

2 Answers2

6

This approach, which appears to run ~40X faster than OP's, uses lookup tables and takes advantage of the extremely fast data table joins. Also, it takes advantage of the fact that, while there may be 1e6 combinations of date and time, there can be at most 86400 unique times, and probably even fewer dates. Finally, it avoids the use of paste(...) altogether.

library(data.table)
library(stringr)

# create a dataset with 1MM rows
set.seed(1)
x  <- 1000*sample(1:1e5,1e6,replace=T)
dt <- data.table(id=1:1e6,
                 V1=format(as.POSIXct(x,origin="2011-01-01"),"%d/%m/%Y"),
                 V2=format(as.POSIXct(x,origin="2011-01-01"),"%H:%M:%S"),
                 V3=x)
DT <- dt

index.date <- function(dt) {
  # Edit: this change processes only times from the dataset; slightly more efficient
  V2 <- unique(dt$V2)
  dt.time <- data.table(char.time=V2,
                        int.time=as.integer(substr(V2,7,8))+
                          60*(as.integer(substr(V2,4,5))+
                                60*as.integer(substr(V2,1,2))))
  setkey(dt.time,char.time)
  # all dates from dataset
  dt.date <- data.table(char.date=unique(dt$V1), int.date=as.integer(as.POSIXct(unique(dt$V1),format="%d/%m/%Y")))
  setkey(dt.date,char.date)
  # join the dates
  setkey(dt,V1)
  dt <- dt[dt.date]
  # join the times
  setkey(dt,V2)
  dt <- dt[dt.time, nomatch=0]
  # numerical index
  dt[,int.index:=int.date+int.time]
  # POSIX date index
  dt[,index:=as.POSIXct(int.index,origin='1970-01-01')]
  # get back original order
  setkey(dt,id)
  return(dt)
}
# new approach
system.time(dt<-index.date(dt))
#   user  system elapsed 
#   2.26    0.00    2.26 


# original approach
DT <- dt
system.time(DT[,`:=`(index= as.POSIXct(paste(V1,V2),
                                       format='%d/%m/%Y %H:%M:%S'),
                     V1=NULL,V2=NULL)])
#   user  system elapsed 
#  84.33    0.06   84.52 

Note that performance does depend on how many unique dates there are. In the test case there were ~1200 unique dates.

EDIT proposition to write the function in more data.table-sugar syntax and avoid "$" for subsetting:

index.date <- function(dt,fmt="%d/%m/%Y") {
    dt.time <- data.table(char.time = dt[,unique(V2)],key='char.time')
    dt.time[,int.time :=as.integer(substr(char.time,7,8))+
                                            60*(as.integer(substr(char.time,4,5))+
                                                        60*as.integer(substr(char.time,1,2)))]
    # all dates from dataset
    dt.date <- data.table(char.date = dt[,unique(V1)],key='char.date')
    dt.date[,int.date:=as.integer(as.POSIXct(char.date,format=fmt))]
    # join the dates
    setkey(dt,V1)
    dt <- dt[dt.date]
    # join the times
    setkey(dt,V2)
    dt <- dt[dt.time, nomatch=0]
    # numerical index
    dt[,int.index:=int.date+int.time]
    # POSIX date index
    dt[,index:=as.POSIXct.numeric(int.index,origin='1970-01-01')]
    # remove extra/temporary variables
    dt[,`:=`(int.index=NULL,int.date=NULL,int.time=NULL)]
}
agstudy
  • 119,832
  • 17
  • 199
  • 261
jlhoward
  • 58,004
  • 7
  • 97
  • 140
  • Minor change to process only unique times from the dataset; slightly more efficient (~2.2s vs 3.9s before). – jlhoward Dec 11 '13 at 02:12
  • Thanks. very good idea to use data.table join. I slightly modify your solution to put it in more data.table sugar syntax. – agstudy Dec 11 '13 at 08:21
  • @agstudy: Much cleaner, thanks. But why does the last statement recover the original order? – jlhoward Dec 11 '13 at 08:28
  • it is a type ( I left the old comment). I change it. I suggest we factorize the answers to give a short one. what do you think? – agstudy Dec 11 '13 at 08:31
  • When you use `:=` it does assignment by reference. You don't have to assign it back (using `<-`) ex: `dt.date <- dt.date[,int.date:=as.integer(as.POSIXct(char.date,format=fmt))]` could be: `dt.date[,int.date:=as.integer(as.POSIXct(char.date,format=fmt))]` – Arun Dec 11 '13 at 08:41
  • 1
    Does this take leap seconds into account? – Joshua Ulrich Jan 02 '16 at 16:00
2

If there are many time stamps that will be repeated in your data, you can try adding ,by=list(V1, V2), but there would have to be enough repetition to pay for the cost of splitting.

The bottle neck here is the paste & conversion, so that leads me to think that the answer is no. (Unless you use an alternate method of converting to POSIX)

Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
  • Thanks. I forget to mention that , I should not have duplicated dates ( it is basically a time series data). – agstudy Dec 10 '13 at 00:05
  • 5
    And further, the bottleneck is really conversion, which takes something like 40 times longer than the paste step. If your dates were in true POSIX format (i.e. "1970-01-01" instead of "01/01/1970"), you could speed things hugely using Simon Urbanek's fasttime library, [as noted here](http://stackoverflow.com/questions/12898318/convert-character-to-date-quickly-in-r/12898544#12898544). – Josh O'Brien Dec 10 '13 at 00:12
  • 1
    @JoshO'Brien you're right about fasttime. For example, using something like `DT[,fastPOSIXct(paste(as.Date(V1,format="%d/%m/%Y"),V2))]` improves performance by at least 4 times. – agstudy Dec 11 '13 at 08:28