2

I am trying to calculate the number of overlapping days between two time periods. One period is fixed in a start and end date, the other is recorded as start and end dates in a data frame.

Edit: I'm dealing with ads that have publish date (df$start) and an unpublish date (df$end). What I'm trying to find out is how many days they have been online in a specific month (my.start = 2018-01-01, my.end = 2018-08-31).

library(dplyr)
library(lubridate)

my.start <- ymd("2018-08-01")
my.end <- ymd("2018-08-31")

df <- data.frame(start = c("2018-07-15", "2018-07-20", "2018-08-15", "2018-08-20", "2018-09-01"), 
                 end   = c("2018-07-20", "2018-08-05", "2018-08-19", "2018-09-15", "2018-09-15"))

# strings to dates
df <- mutate(df, start = ymd(start), end = ymd(end))

# does not work - calculate overlap in days
df <- mutate(df, overlap = intersect(interval(my.start, my.end), interval(start, end)))

Results should be 0, 5, 4, 12, 0 days:

   my.start |-------------------------------| my.end

|-----| (0)
        |---------| (5)
                            |----| (4)
                                   |------------------| (12)
                                             |---------------| (0)

In Excel, I would use

=MAX(MIN(my.end, end) - MAX(my.start, start) + 1, 0)

but that doesn't work either:

# does not work - calculate via min/max
df <- mutate(df, overlap = max(min(my.end, end) - max(my.start, start) + 1, 0))

Before I try to use the Excel approach using as.numeric() on the dates, I wondered if there is a cleverer way to do this.

EDIT: Actually, the Excel numeric approach doesn't seem two work either (all results are zero):

# does not work - calculate via numeric

ms.num <- as.numeric(my.start)
me.num <- as.numeric(my.end)

df <- df %>% 
  mutate(s.num = as.numeric(start),
         e.num = as.numeric(end),

         overlap = max(min(e.num, me.num) - max(s.num, ms.num) + 1, 0))

Edit: The approach by @akrun seems to work for ymd dates. However, it doesn't seem to work for ymd_hms times:

library(dplyr)
library(lubridate)
library(purrr)

my.start <- ymd("2018-08-01")
my.end <- ymd("2018-08-31")

df <- data.frame(start = c("2018-07-15 10:00:00", "2018-07-20 10:00:00", "2018-08-15 10:00:00", "2018-08-20 10:00:00", "2018-09-01 10:00:00"), 
                 end   = c("2018-07-20 10:00:00", "2018-08-05 10:00:00", "2018-08-19 10:00:00", "2018-09-15 10:00:00", "2018-09-15 10:00:00"))

# strings to dates
df <- mutate(df, start = ymd_hms(start), end = ymd_hms(end))

# leads to 0 results
df %>% mutate(overlap = map2(start, end, ~ sum(seq(.x, .y, by = '1 day') %in% seq(my.start, my.end, by = '1 day'))))
Timm S.
  • 5,135
  • 6
  • 24
  • 38
  • I'm not sure I understand what you're trying to do. Where do `my.start` and `my.end` come from? Can you please edit your post to include your expected output for the sample data you give (don't add critical information in comments, as comments are transient). – Maurits Evers Sep 07 '18 at 14:45
  • Edit: added some context and expected results – Timm S. Sep 07 '18 at 14:55
  • 2
    Try `df %>% mutate(overlap = map2(start, end, ~ sum(seq(.x, .y, by = '1 day') %in% seq(my.start, my.end, by = '1 day'))))` I think it is `0 5 5 12 0` – akrun Sep 07 '18 at 15:12
  • @akrun: This works for the example, thank you! However, in my actual data, I am dealing with ymd_hms data, which leads to 0 as a result. I'm posting an example above. – Timm S. Sep 07 '18 at 15:29
  • @TimmS. If you have `ymd_hms`, can convert to `Date` class with `as.Date` and then it should work i.e. your new data won't work with the second solution posted below. for that `df %>% mutate(overlap = map2_dbl(start, end, ~ max(as.integer(min(my.end, as.Date(.y)) - max(my.start, as.Date(.x)) + 1), 0)))` – akrun Sep 07 '18 at 15:36

2 Answers2

5

I think you may be running into issues with max and min vs pmax and pmin:

library(dplyr)

df %>%
  mutate(overlap = pmax(pmin(my.end, end) - pmax(my.start, start) + 1,0))

       start        end overlap
1 2018-07-15 2018-07-20  0 days
2 2018-07-20 2018-08-05  5 days
3 2018-08-15 2018-08-19  5 days
4 2018-08-20 2018-09-15 12 days
5 2018-09-01 2018-09-15  0 days
zack
  • 5,205
  • 1
  • 19
  • 25
3

We can use pmin/pmax to get the min/max of two sets of vectors

df %>% 
   mutate(overlap = ifelse(my.start > end, 0, pmin(my.end, end) - 
                                 pmax(my.start, start) + 1))
#       start        end overlap
#1 2018-07-15 2018-07-20   0
#2 2018-07-20 2018-08-05   5
#3 2018-08-15 2018-08-19   5
#4 2018-08-20 2018-09-15  12
#5 2018-09-01 2018-09-15   0

If we want to use the same option as in the OP's code, i.e. min/max, either with rowwise() or using map2, we loop through rows

library(purrr)
df %>% 
  mutate(overlap = map2_dbl(start, end, ~
        max( as.integer(min(my.end, .y) - max(my.start, .x) + 1), 0)))

Noticed that the OP's actual data have time component. In that case, change the above solution by converting to Date class

df %>% 
   mutate(overlap = map2_dbl(start, end, ~
     max(as.integer(min(my.end, as.Date(.y)) - max(my.start, as.Date(.x)) + 1), 0)))
akrun
  • 874,273
  • 37
  • 540
  • 662