2

I've found recently a strange behaviour of data.table's assignment operator := when I want to assign to a column a lubridate's period object. It does assign only the very first period to all cells. Here is MRE

library(data.table)
library(lubridate)

data.table(x = 1:5)[x == 3, p := period(7, "day")
  ][x == 4, p := period(1, "month")][]

#    x           p
# 1: 1        <NA>
# 2: 2        <NA>
# 3: 3 7d 0H 0M 0S
# 4: 4 7d 0H 0M 0S
# 5: 5        <NA>

My packages are from CRAN, data.table's version is 1.11.2 and lubridate's is 1.7.4

Does anyone know what's happening here and how to make it work properly?

inscaven
  • 2,514
  • 19
  • 29

1 Answers1

0

Problem

Short version:

data.table does not handle complex data types well.

Long version:

When assigning/subsetting the period column, data.table seems to only touch .Data containing the seconds. As far as I understand it, a period object is something like a double describing the amount of seconds with attributes year, month,etc.

data.table only handles those actual values well (.Data/seconds), but the other attributes apply to the entire column.

Some illustrating examples:

# Only .Data gets subsetted, no other slots
DT <- data.table(x = 1:3)
DT$p <- rep(period(7, "days"), 3)
str(DT[1,])

# Classes ‘data.table’ and 'data.frame':    1 obs. of  2 variables:
#     $ x: int 1
# $ p:Formal class 'Period' [package "lubridate"] with 6 slots
# .. ..@ .Data : num 0
# .. ..@ year  : num  0 0 0
# .. ..@ month : num  0 0 0
# .. ..@ day   : num  7 7 7
# .. ..@ hour  : num  0 0 0
# .. ..@ minute: num  0 0 0
# - attr(*, ".internal.selfref")=<externalptr>

# Only assigment to .Data, no other slots
DT[1, p := 40]
DT

#    x            p
# 1: 1 7d 0H 0M 40S
# 2: 2  7d 0H 0M 0S
# 3: 3  7d 0H 0M 0S

# DT even translates the multiple attributes into multiple values
# Notice how the seconds are correct.
DT <- data.table(x = 1:3)
DT$p <- c(period("1Y1S"), period("2Y2S"), period("3Y3S"))
DT[1,]$p
# [1] "1y 0m 0d 0H 0M 1S" "2y 0m 0d 0H 0M 1S" "3y 0m 0d 0H 0M 1S"

The internals of why data.table behaves like this and if they will ever fully support period and other complex data types, I do not know. I would suggest to keep an eye on https://github.com/Rdatatable/data.table/ if you want an answer to this.

Related issues:

data.table can not handle multiple time zone attributes for a POSIXct column.

Adding timezone to POSIXct object in data.table

https://github.com/Rdatatable/data.table/issues/4974

https://github.com/Rdatatable/data.table/issues/4415

Possible workarounds

  1. As a general solution, you can always use a column of type list for complex data types. It's a bit harder to reason with sometimes, but it always works. I would recommend this if you are not expecting to filter on that column.
DT <- data.table(x = 1:5)[x == 3, p := list(list(period(7, "day")))]
DT[x == 4, p := period(1, "month")]
DT[]
#   x              p
# 1: 1               
# 2: 2               
# 3: 3    7d 0H 0M 0S
# 4: 4 1m 0d 0H 0M 0S
# 5: 5       

DT[p > period(1, "month"),]
# Error: 'list' object cannot be coerced to type 'double'

largerThanMonth <- function(x){
  if(is.null(x)){
    FALSE
  } else{
    x >=  period(1, "month")
  }
}

DT[sapply(p, largerThanMonth),]
#    x              p
# 1: 4 1m 0d 0H 0M 0S
  1. Native data.frame's seem to be able to work properly.
DT <- data.frame(x = 1:3)
DT$p <- c(period("1Y1S"), period("2Y2S"), period("3Y3S"))
DT[1,]$p
# [1] "1y 0m 0d 0H 0M 1S"

DT[1,]$p <- period("1M")
#   x                 p
# 1 1             1M 0S
# 2 2 2y 0m 0d 0H 0M 2S
# 3 3 3y 0m 0d 0H 0M 3S

  1. Convert the period to numeric or character.
jsch
  • 86
  • 3