9

Edit (2019-06): This problem does not exist anymore, as this issue has been closed and a related feature implemented. If you now run the code with updated packages, it will work.

I'm trying to find overlapping intervals and decided to join the interval data on itself with dplyr::left_join() so that I could compare intervals with lubridate::int_overlaps() to every other interval by the same id.

Here's how I expect left_join() to behave. The two tibbles with three rows cross to form a with 9 rows:

library(tidyverse)

tibble(a = rep("a", 3), b = rep(1, 3)) %>% 
  left_join(tibble(a = rep("a", 3), c = rep(2, 3)))
Joining, by = "a"
# A tibble: 9 x 3
      a     b     c
  <chr> <dbl> <dbl>
1     a     1     2
2     a     1     2
3     a     1     2
4     a     1     2
5     a     1     2
6     a     1     2
7     a     1     2
8     a     1     2
9     a     1     2

And here's how the same code behaves with intervals. I get nine rows but the rows don't cross like they do above:

tibble(a = rep("a", 3), b = rep(make_date(2001) %--% make_date(2002), 3)) %>% 
  left_join(tibble(a = rep("a", 3), c = rep(make_date(2002) %--% make_date(2003))))
Joining, by = "a"
# A tibble: 9 x 3
      a                              b                              c
  <chr>                 <S4: Interval>                 <S4: Interval>
1     a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
2     a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
3     a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
4     a                         NA--NA                         NA--NA
5     a                         NA--NA                         NA--NA
6     a                         NA--NA                         NA--NA
7     a                         NA--NA                         NA--NA
8     a                         NA--NA                         NA--NA
9     a                         NA--NA                         NA--NA

I think this is unexpected, but I might be missing something? Or is it a bug?

I'm using 1.7.1, 1.3.4 and 0.7.4.

pasipasi
  • 1,176
  • 10
  • 8

3 Answers3

7

The bug

The object still contains the relevant information:

res <- tibble(a = rep("a", 3), b = rep(make_date(2001) %--% make_date(2002), 3)) %>% 
  left_join(tibble(a = rep("a", 3), c = rep(make_date(2002) %--% make_date(2003)))) 

print.data.frame(res)
# a                              b                              c
# 1 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 2 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 3 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 4 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 5 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 6 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 7 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 8 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 9 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC

res$c    
# [1] 2002-01-01 UTC--2003-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# [5] 2002-01-01 UTC--2003-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# [9] 2002-01-01 UTC--2003-01-01 UTC

But when subsetting by indices it doesn't work anywmore :

res_df <- as.data.frame(res)

head(res_df)
  a                              b                              c
1 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
2 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
3 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
4 a                         NA--NA                         NA--NA
5 a                         NA--NA                         NA--NA
6 a                         NA--NA                         NA--NA

res_df[4,"c"]
[1] NA--NA

and tibble:::print.tbl makes use of head. That's why the issue is immediately visible with tibbles and not with data.frames.

Typing str(res$b) we see that we only have 3 start values for 9 data values.

if we do:

res_df$b@start <- rep(res_df$b@start,3)
res_df$c@start <- rep(res_df$c@start,3)

eveything now print fine:

  a                              b                              c
1 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
2 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
3 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
4 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
5 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
6 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
7 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
8 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
9 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC

The Solution

We've seen that as.data.frame is not enough, left_join is the function messing things up, use merge instead:

res <- tibble(a = rep("a", 3), b = rep(make_date(2001) %--% make_date(2002), 3)) %>% 
  merge(tibble(a = rep("a", 3), c = rep(make_date(2002) %--% make_date(2003))),
        all.x=TRUE) 

head(res)
# a                              b                              c
# 1 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 2 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 3 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 4 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 5 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 6 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC

res[4,"c"]
#[1] 2002-01-01 UTC--2003-01-01 UTC

I've reported the issue here

moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
  • 5
    There's this [meta issue](https://github.com/tidyverse/dplyr/issues/2432) for better support for non-base types in `dplyr`. And [this issue](https://github.com/hadley/vctrs/issues/27) in vctrs. – pasipasi Feb 12 '18 at 06:51
  • Thanks. Relevant here is that it seems one might have issues with `dplyr::filter` as well when dealing with lubridate intervals, still because of `start` slot. – moodymudskipper Feb 12 '18 at 07:23
4

Looks like a bug in tibble():

> AA <- tibble(a = rep("a", 3), b = rep(make_date(2001) %--% make_date(2002), 3))
> class(AA$b)
[1] "Interval"
attr(,"package")
[1] "lubridate"
> AA
Error in round_x - lhs :
  Arithmetic operators undefined for 'Interval' and 'Interval' classes:
  convert one to numeric or a matching time-span class.

However:

> AA <- as.data.frame(AA)
class(AA$b)
> class(AA$b)
[1] "Interval"
attr(,"package")
[1] "lubridate"
> AA
  a                              b
1 a 2001-01-01 UTC--2002-01-01 UTC
2 a 2001-01-01 UTC--2002-01-01 UTC
3 a 2001-01-01 UTC--2002-01-01 UTC

Therefore, this works:

> AA <- tibble(a = rep("a", 3), b = rep(make_date(2001) %--% make_date(2002), 3))
> BB <- tibble(a = rep("a", 3), c = rep(make_date(2002) %--% make_date(2003)))
> AA %>% as.data.frame %>% left_join(BB)
Joining, by = "a"
  a                              b                              c
1 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
2 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
3 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
4 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
5 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
6 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
7 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
8 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
9 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC

although this does not:

> AA %>% left_join(BB)
Joining, by = "a"
Error in round_x - lhs :
  Arithmetic operators undefined for 'Interval' and 'Interval' classes:
  convert one to numeric or a matching time-span class.

Note: I'm using tibble_1.4.1 (same version of lubridate and dplyr as you), on R 3.4.3 for x86_64-pc-linux-gnu

renato vitolo
  • 1,744
  • 11
  • 16
  • 1
    Interesting. Thanks. With the same package versions as OP, thou on R version 3.3.3, I get identical output as OP, i.e. no error message. Thanks for your work. – Eric Fail Feb 11 '18 at 18:39
1

This problem does not exist anymore, as this issue has been closed and a related feature implemented. If you now run the code with updated packages, it will work.

library(lubridate)
library(tidyverse)

tibble(a = rep("a", 3), b = rep(make_date(2001) %--% make_date(2002), 3)) %>% 
  left_join(tibble(a = rep("a", 3), c = rep(make_date(2002) %--% make_date(2003))))
#> Joining, by = "a"
#> # A tibble: 9 x 3
#>   a     b                              c                             
#>   <chr> <Interval>                     <Interval>                    
#> 1 a     2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
#> 2 a     2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
#> 3 a     2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
#> 4 a     2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
#> 5 a     2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
#> 6 a     2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
#> 7 a     2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
#> 8 a     2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
#> 9 a     2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC

Created on 2019-06-07 by the reprex package (v0.3.0)

pasipasi
  • 1,176
  • 10
  • 8