Efficient comparison of POSIXct in data.table

Question

Hello I am looking for an efficient way of selecting POSIXct rows from a data.table such that the time of day is less than say 12:00:00 (NOTE that millisecond is NOT required, so we can use ITime for example)

set.seed(1); N = 1e7;
DT = data.table(dts = .POSIXct(1e5*rnorm(N), tz="GMT"))
DT
                               dts
#       1: 1969-12-31 06:35:54.618925
#       2: 1970-01-01 05:06:04.332422
#     ---                           
# 9999999: 1970-01-03 00:37:00.035565
#10000000: 1969-12-30 08:30:23.624506

One solution (the problem here is that the cast could be costly if N is big)

f <- function(t, st, et) {time <- as.ITime(t); return(time>=as.ITime(st) & time<=as.ITime(et))}
P <- function(t, s) { #geekTrader solution
    ep <- .parseISO8601(s) 
    if(grepl('T[0-9]{2}:[0-9]{2}:[0-9]{2}/T[0-9]{2}:[0-9]{2}:[0-9]{2}', s)){
        first.time <- as.double(ep$first.time)
        last.time <- as.double(ep$last.time)-31449600
        SecOfDay <- as.double(t) %% 86400
        return(SecOfDay >= first.time & SecOfDay <= last.time )
    } else {
        return(t >= ep$first.time & t <= ep$last.time)    
    }
}

Quick look about the perf

system.time(resf <- DT[f(dts,'00:00:00','11:59:59')])
   user  system elapsed 
   1.01    0.28    1.29
system.time(resP <- DT[P(dts,'T00:00:00/T11:59:59')])
   user  system elapsed 
   0.64    0.13    0.76 

identical(resf,resP)
[1] TRUE

Are you happy enough to create an `itime` column and key by it? — mnel, Apr 08 '13 at 23:14
You really shouldn't edit people's answers into your question — GSee, Apr 09 '13 at 13:50
I see... why is that ? looks better for understanding to me... — statquant, Apr 09 '13 at 13:58
@statquant : just curious why was bounty awarded to the Richie's answer? — CHP, Apr 16 '13 at 01:10
@geektrader : I am curious too I think it is supposed to be 50/50 as both of you got >4 up, that's why I did not award it myself. I've let the admin know. Sorry about that. — statquant, Apr 16 '13 at 06:54
@statquant : ok. I was quite surprised when I saw bounty being awarded to `as.POSIXlt` answer :) — CHP, Apr 16 '13 at 08:17
@geektrader: It seems that ONLY the oldest answer with a more than 2 points gets the bounty! I did not know the rule, as the whole bounty is always taken off the OP I thought the whole would be awawred or at least granted pro-rata... :( — statquant, Apr 16 '13 at 15:38

CHP · Accepted Answer · 2013-04-09T11:26:03.837

 P <- function(t, s) {
  ep <- .parseISO8601(s)

  if(grepl('T[0-9]{2}:[0-9]{2}:[0-9]{2}/T[0-9]{2}:[0-9]{2}:[0-9]{2}', s)){
    first.time <- as.double(ep$first.time)
    last.time <- as.double(ep$last.time)-31449600
    SecOfDay <- as.double(t) %% 86400
    return(SecOfDay >= first.time & SecOfDay <= last.time )

  } else {
    return(t >= ep$first.time & t <= ep$last.time)    
  }

}

F <- function(t, st, et) {
  time <- as.ITime(t) 
  return(time>=as.ITime(st) & time<=as.ITime(et))
}


 Sys.setenv(TZ='GMT')
 N = 1e7;
 set.seed(1);

 DT <- data.table(dts = .POSIXct(1e5*rnorm(N), tz="GMT"))


 system.time(resP <- DT[P(dts, 'T00:00:00/T12:00:00'), ])
##   user  system elapsed 
##   1.11    0.11    1.22 
 system.time(resF <- DT[F(dts,'00:00:00','12:00:00')])
##   user  system elapsed 
##   2.22    0.29    2.51 

 resP
##                         dts
##      1: 1969-12-31 06:35:54
##      2: 1970-01-01 05:06:04
##      3: 1969-12-31 00:47:17
##      4: 1970-01-01 09:09:10
##      5: 1969-12-31 01:12:33
##     ---                    
##5000672: 1970-01-01 06:08:15
##5000673: 1970-01-01 05:02:27
##5000674: 1969-12-31 02:25:24
##5000675: 1970-01-03 00:37:00
##5000676: 1969-12-30 08:30:23
 resF
##                         dts
##      1: 1969-12-31 06:35:54
##      2: 1970-01-01 05:06:04
##      3: 1969-12-31 00:47:17
##      4: 1970-01-01 09:09:10
##      5: 1969-12-31 01:12:33
##     ---                    
##5000672: 1970-01-01 06:08:15
##5000673: 1970-01-01 05:02:27
##5000674: 1969-12-31 02:25:24
##5000675: 1970-01-03 00:37:00
##5000676: 1969-12-30 08:30:23

 #Check the correctness
 resP[,list(mindts=max(dts)),by=list(as.Date(dts))]
##       as.Date              mindts
## 1: 1969-12-31 1969-12-31 12:00:00
## 2: 1970-01-01 1970-01-01 12:00:00
## 3: 1969-12-29 1969-12-29 12:00:00
## 4: 1970-01-02 1970-01-02 12:00:00
## 5: 1969-12-30 1969-12-30 12:00:00
## 6: 1970-01-03 1970-01-03 12:00:00
## 7: 1970-01-04 1970-01-04 11:59:59
## 8: 1970-01-05 1970-01-05 11:59:45
## 9: 1969-12-28 1969-12-28 12:00:00
##10: 1969-12-27 1969-12-27 11:59:21
##11: 1970-01-06 1970-01-06 10:53:21
##12: 1969-12-26 1969-12-26 10:15:03
##13: 1970-01-07 1970-01-07 08:21:55
 resF[,list(mindts=max(dts)),by=list(as.Date(dts))]
##       as.Date              mindts
## 1: 1969-12-31 1969-12-31 12:00:00
## 2: 1970-01-01 1970-01-01 12:00:00
## 3: 1969-12-29 1969-12-29 12:00:00
## 4: 1970-01-02 1970-01-02 12:00:00
## 5: 1969-12-30 1969-12-30 12:00:00
## 6: 1970-01-03 1970-01-03 12:00:00
## 7: 1970-01-04 1970-01-04 11:59:59
## 8: 1970-01-05 1970-01-05 11:59:45
## 9: 1969-12-28 1969-12-28 12:00:00
##10: 1969-12-27 1969-12-27 11:59:21
##11: 1970-01-06 1970-01-06 10:53:21
##12: 1969-12-26 1969-12-26 10:15:03
##13: 1970-01-07 1970-01-07 08:21:55

Now some demo of nice xts style subsetting

 DT[P(dts, '1970')]
##                         dts
##      1: 1970-01-01 05:06:04
##      2: 1970-01-02 20:18:48
##      3: 1970-01-01 09:09:10
##      4: 1970-01-01 13:32:22
##      5: 1970-01-01 20:30:32
##     ---                    
##5001741: 1970-01-02 15:51:12
##5001742: 1970-01-03 01:41:31
##5001743: 1970-01-01 06:08:15
##5001744: 1970-01-01 05:02:27
##5001745: 1970-01-03 00:37:00
 DT[P(dts, '197001')]
##                         dts
##      1: 1970-01-01 05:06:04
##      2: 1970-01-02 20:18:48
##      3: 1970-01-01 09:09:10
##      4: 1970-01-01 13:32:22
##      5: 1970-01-01 20:30:32
##     ---                    
##5001741: 1970-01-02 15:51:12
##5001742: 1970-01-03 01:41:31
##5001743: 1970-01-01 06:08:15
##5001744: 1970-01-01 05:02:27
##5001745: 1970-01-03 00:37:00
 DT[P(dts, '19700102')]
##                         dts
##      1: 1970-01-02 20:18:48
##      2: 1970-01-02 17:59:38
##      3: 1970-01-02 07:14:53
##      4: 1970-01-02 02:13:03
##      5: 1970-01-02 01:31:37
##     ---                    
##1519426: 1970-01-02 11:25:24
##1519427: 1970-01-02 10:00:21
##1519428: 1970-01-02 05:21:25
##1519429: 1970-01-02 05:11:26
##1519430: 1970-01-02 15:51:12
 DT[P(dts, '19700102 00:00:00/19700103 12:00:00')]
##                         dts
##      1: 1970-01-02 20:18:48
##      2: 1970-01-02 17:59:38
##      3: 1970-01-02 07:14:53
##      4: 1970-01-02 02:13:03
##      5: 1970-01-02 01:31:37
##     ---                    
##1785762: 1970-01-02 05:21:25
##1785763: 1970-01-02 05:11:26
##1785764: 1970-01-02 15:51:12
##1785765: 1970-01-03 01:41:31
##1785766: 1970-01-03 00:37:00

 #Check the correctness again
 DT[P(dts, '19700102 00:00:00/19700103 12:00:00'), max(dts)]
##[1] "1970-01-03 12:00:00 GMT"
 DT[P(dts, '19700102 00:00:00/19700103 12:00:00'), min(dts)]
##[1] "1970-01-02 00:00:00 GMT"

Rightly or wrongly, I tend to use HHMMSS stored as plain integer; i.e. no epoch, no class. — Matt Dowle, Apr 07 '13 at 10:08
Nice. Storing time as seconds and `%% 86400` seems like a nice straightforward solution, and mostly likely pretty fast. — Aaron left Stack Overflow, Apr 09 '13 at 04:45
@statquant Seem to be working ok after we replace `as.integer` by `as.double` — CHP, Apr 09 '13 at 11:26
Well done, identical. But would that work on a leap year ? cause there is more than 86400 seconds those years... — statquant, Apr 09 '13 at 13:54
It will work on leap year too. POSIXct ignores leap seconds. Read that somewhere in documentation — CHP, Apr 09 '13 at 14:10

score 4 · Answer 2 · answered Apr 11 '13 at 08:04

4

The canonical way of doing this is to convert to POSIXlt and extract the hour component.

hour(as.POSIXlt(DT$dts, "GMT")) < 12

This seems to be comparable in performance to the other techniques discussed (and is easier to understand).

answered Apr 11 '13 at 08:04

Richie Cotton

118,240
47
247
360

+1 Surely this is just about the fastest way? If not, it is certainly the most comprehensible to me. It reduced 1e7 rows to ~ 5e6 rows (what I would expect given `rnorm`) in ~ 1 second. And requires no more than base R. – Simon O'Hanlon Apr 11 '13 at 12:51
`data.table::hour` is already using `as.POSIXlt`, so you can write something like `DT[, morning:= (hour(DT$dts) < 12)]`. – Oscar Perpiñán Apr 12 '13 at 14:07
Have you guys tried benchmarking? My initial benchmarking revealed `as.POSIXlt` didn't really give as much performance boost.. – CHP Apr 12 '13 at 16:12

GSee · Answer 3 · 2013-04-07T22:31:22.700

Here's a way that uses some of the functionality from xts to accomplish what you want. This is not a great solution because xts objects must be ordered by time, but data.table objects do not have to be. Also, it may not be terribly fast since there is some redundant work being done by xts and data.table. Nonetheless, I thought it might be interesting.

library(data.table)
library(xts)
set.seed(1); N = 1e5;
# I tweaked the following line to make this reproducible in other timezones.
DT = data.table(dts = .POSIXct(1e5*rnorm(N), tz="GMT"))
setkey(DT, dts)  # must sort on time first so that the `xts` object we're about 
                 # to create has the same order
DT[, XTS:=xts(rep(NA, .N), dts)]  # add a dummy xts object as a column
DT[XTS["T00:00:00/T11:59:59.999999", which=TRUE]][, list(dts)] 
                       dts
    1: 1969-12-27 00:28:41
    2: 1969-12-27 00:34:00
    3: 1969-12-27 03:11:21
    4: 1969-12-27 04:20:27
    5: 1969-12-28 00:00:21
   ---                    
49825: 1970-01-05 08:05:22
49826: 1970-01-05 09:35:32
49827: 1970-01-05 09:49:49
49828: 1970-01-05 09:50:27
49829: 1970-01-05 11:07:32

The above uses an xts-style subsetting string to get the rows where the time is between 00:00:00 and 12:00:00 for every day. Using which=TRUE returns the row number instead of the data from that row, so that we can subset the data.table by those rows.

You could use a string like "1970-01-01" to get all data from that day, or "1970-01" to get all data from January 1970, or "1970-01-01/1970-01-02" to get all rows from those two days.

Yes, as you point out, the ordering (`setkey` operation) seems to be expensive on bigger tables (1e7). And for some reason, the test result with 1e7 rows (seed = 1) gives about 60 rows more in your solution (maybe it's got to do with the milliseconds?) — Arun, Apr 07 '13 at 22:07
@Arun "T00:00:00/T12:00:00" will also include times up to but not including 12:00:01. I have edited to use a more precise subset string that gives the same number of rows as the OP. — GSee, Apr 07 '13 at 22:31
Thanks Gsee, though I find this x2 time less efficient than based method (function `f`), and that's without the sorting — statquant, Apr 09 '13 at 10:11

score 0 · Answer 4 · answered Apr 16 '13 at 02:27

A late entry, but I think that the as.POSIXlt solution will create a named list of vectors, of which you only want the hour

I would key by an ITime column and then use a binary search to subset those times before 12pm

There are 60*60 *12 - 1 seconds before 12pm so seq_len(43199) will return everything up to (but not including) 12pm

# create IDate and ITime columns and key by time
setkey(DT[, c('Date','Time') := IDateTime(dts)],Time)

# subset times before 12pm
DT[.(seq_len(43199))]

That's very good actually, let me test the leap years and summer time stuff, it is always more tricky than it seems. Thanks — statquant, Apr 16 '13 at 06:56

Efficient comparison of POSIXct in data.table

4 Answers4