5

My question will be explained throughout the following reproducible example.

First, let's load the required packages and create a POSIXct and a data.table object.

library(data.table)
library(lubridate)

target_date <- ymd(20180601, tz='America/Montreal')

test <- data.table(
  V1 = seq(1:3),
  V2 = c(ymd(20170421, tz='America/Montreal'),
         ymd(20170702, tz='America/Montreal'),
         ymd(20180113, tz='America/Montreal'))
)

As we can see below, everything worked fine.

test[]

##     V1         V2
##  1:  1 2017-04-21
##  2:  2 2017-07-02
##  3:  3 2018-01-13

Then, I create a new column which contains Interval objects.

test[, V3:=interval(V2, target_date)]

Still there, everything is alright.

test[]

##     V1         V2                             V3
##  1:  1 2017-04-21 2017-04-21 EDT--2018-06-01 EDT
##  2:  2 2017-07-02 2017-07-02 EDT--2018-06-01 EDT
##  3:  3 2018-01-13 2018-01-13 EST--2018-06-01 EDT

Then, I want to create two new columns which will only be defined on the first two rows: the first one is a function of the POSIXct column and the second one is a function of the Interval column.

test[c(1, 3), V4 := V2 + days(20)]
test[]

##     V1         V2                             V3         V4
##  1:  1 2017-04-21 2017-04-21 EDT--2018-06-01 EDT 2017-05-11
##  2:  2 2017-07-02 2017-07-02 EDT--2018-06-01 EDT       <NA>
##  3:  3 2018-01-13 2018-01-13 EST--2018-06-01 EDT 2018-02-02

Still correct at this point.

Then, when I try the same thing with a column function of the Interval column, I get the following warning

test[c(1, 3), V5 := ymd(20180101, tz='America/Montreal') %within% V3]

##  Warning messages:
##  1: In as.numeric(a) - as.numeric(b@start) <= b@.Data :
##    longer object length is not a multiple of shorter object length
##  2: In `[.data.table`(test, c(1, 3), `:=`(V5, ymd(20180101, tz = "America/Montreal") %within%  :
##  Supplied 3 items to be assigned to 2 items of column 'V5' (1 unused)

This warning tells me that, in fact, the command produced 3 values and that I try to input those in a 2 rows data.table (the filtered data.table has two rows).

I tried to figure out what the problem was and I think I found a hint. Consider the two following commands:

test[, V3][2]
##  [1] 2017-07-02 EDT--2018-06-01 EDT

test[2, V3]
##  [1] 2017-04-21 EDT--2018-03-21 EDT 2017-07-02 EDT--2018-06-01 EDT 2018-01-13 EST--2018-12-13 EST

In fact, I thought those two commands would have produce the same result, which is not the case. Even more surprising, they both have a length of 1 (when looking at the output above, I thought the second command would have a length of 3).

length(test[, V3][2])
##  [1] 1

length(test[2, V3])
##  [1] 1

The thing is, these two commands produce an object with a vector of 1 in the .Data slot (the length of the Interval in seconds I guess),

test[, V3][2]@.Data
##  [1] 28857600

test[2, V3]@.Data
##  [1] 28857600

but the first has a vector of 1 in the start slot while the second has a vector of 3 in the start slot.

test[, V3][2]@start
##  [1] "2017-07-02 EDT"

test[2, V3]@start
##  [1] "2017-04-21 EDT" "2017-07-02 EDT" "2018-01-13 EST"

I know I could probably fix this problem by encapsulating each Interval objects in a list and extract them every time I need, but is there another way to deal with this?

J.P. Le Cavalier
  • 1,315
  • 7
  • 16

1 Answers1

2

Might be a bit filthy, but could you just go:

test[c(1, 3), V5 := ymd(20180101, tz='America/Montreal') %within% V3[.I]]

test[, dat := (V3[.I]@.Data)]
Matt
  • 518
  • 2
  • 5
  • 19
  • 1
    If I do this, I'm going to have the good @start for each entry, but the first @.Data for both entries if I understand well the package behavior...! – J.P. Le Cavalier Feb 01 '18 at 21:40
  • Seems to be close to working for me (see edit). I would probably abandon lubridate if you have a lot of data. – Matt Feb 01 '18 at 21:56
  • 1
    I was thinking too of abandoning `lubridate`, my database has 500M rows, the little more precision I get using `lubridate` doesn't worth it against the computing time I lose... But' I'm still curious of fixing this. – J.P. Le Cavalier Feb 01 '18 at 22:03
  • 1
    I was getting terrible performance with `lubridate` intervals on a table with 350k rows, I would definitely avoid it if you can and just treat the dates like integers. I think that the solution I posted maybe gets the correct `@.Data` for each row, but then maybe not. – Matt Feb 01 '18 at 22:08
  • Thanks for the hint @Matt, I'll use the `data.table` way of treating dates. – J.P. Le Cavalier Feb 01 '18 at 22:16