5

I would like all the data between certain times each day using data.table.

Is this the most efficient (speed wise and memory) way to do these kinds of subsetting?

R.data.table <- data.table(Time = Sys.time() + 1:86400, runif(86400))

R.data.table[Time > as.POSIXct('2016-09-18 08:00:00') & Time < as.POSIXct('2016-09-18 09:00:00')]

I know I can use xts but I like working with data.table because i might use these subsetted data sets for prediction models so I dont need to convert.

I have looked at data.table help on IDate and ITime but I don't really know how to put it all together. Are they faster and easy to work with interactively?

For operations like, these are examples I'm not asking for how to do these directly ..., give me all the data for the last 2 business days of each month, all business day hours. Is doing it like I do above the most efficent way to do it or are there better ways to manipulate time series with data tables in R?

Uwe
  • 41,420
  • 11
  • 90
  • 134
  • I don't see any possible improvement with your method. Your Time is simple number, the selection expression is vectorized. It should be fine unless you met some bottleneck. – dracodoc Sep 18 '16 at 16:04
  • For the operation examples you mentioned, using conditions in data.table `i` should be enough. One possible method is to extract certain part of the date/time like hour into columns, so you don't need extract them again and again in queries. – dracodoc Sep 18 '16 at 16:07
  • thanks for the feedback. You should write it as an answer? –  Sep 18 '16 at 16:48
  • We can wait and see if other people have more suggestions. – dracodoc Sep 18 '16 at 18:10
  • See ?between. It has been parallelised in the current devel version for this case. See my talk at budapest on the project wiki page. – Arun Oct 10 '16 at 19:58
  • @Arun thanks for this feedback! Will do –  Oct 10 '16 at 22:03
  • @Arun Is the version on CRAN currently parallelised? I don't notice much speed improvement using between yet. Maybe my use cases aren't good though. any answers still welcome too –  May 18 '17 at 02:02
  • It seems like parallel version isn't being run here because it's of class `POSIXct`. The function `is_strictly_numeric` inside `between` needs to check for `mode(x) == "numeric"` instead of `is.numeric(x)` perhaps.. Might be better to file an issue. – Arun May 22 '17 at 21:22
  • @Arun it is even worse, it fallback to character type processing which is plain R, filled https://github.com/Rdatatable/data.table/issues/3519 – jangorecki Apr 22 '19 at 18:24
  • @jangorecki it's comparing numerics, just using base R's vector-scan approach. Why do you think it's doing it on char type? – Arun Apr 23 '19 at 02:27
  • @Arun because of the comment in the code that describe the branch to be used for character type – jangorecki Apr 23 '19 at 08:38

1 Answers1

2

Is doing it like I do above the most efficent way to do it or are there better ways to manipulate time series with data tables in R?

The most efficient way for these kinds of subsetting (range subsetting) is to use between function. Unfortunately it currently suffers from a bug, thus it is no faster than the approach you are using. The bug has been fixed, once merged devel package will be published in our CRAN-like repo (including binaries). Another reason for using between is that it is more likely it will be internally optimised in future giving speed/memory improvement, as there is still space for improvement. There is a third way to get expected answer, using non-equi join, but it will be slowest from all three.

library(data.table)
d = data.table(Time = as.POSIXct("2016-09-18 06:00:00") + 1:86400, runif(86400))
dn = as.POSIXct('2016-09-18 08:00:00')
up = as.POSIXct('2016-09-18 09:00:00')
d[Time > dn & Time < up]
d[between(Time, dn, up, incbounds=FALSE)]
d[.(dn=dn, up=up), on=.(Time>dn, Time<up)]

I have looked at data.table help on IDate and ITime but I don't really know how to put it all together. Are they faster and easy to work with interactively?

They can be faster, and are precise. The I prefix stands for Integer. The reason why they were introduced was that POSIXct is a numeric, so suffers from floating point arithmetic problems. Joining or grouping of floating point might result in different answers on different platforms. Integer type is much more portable and can be optimised for operations like sorting, or grouping.


There is a pending feature request for more precise datetime data type: Faster internal date/datetime implementation (with ns resolution..) https://github.com/Rdatatable/data.table/issues/1451


Also there is a roadmap for new vignettes: timeseries - ordered observations https://github.com/Rdatatable/data.table/issues/3453, you might want to consult that issue for more features that data.table offers for ordered datasets, obviously it is just a tiny percent of what xts offers, but usually is highly optimised.

jangorecki
  • 16,384
  • 4
  • 79
  • 160