0

I was wondering is there were a simple way to filter a dataset to only keep records for nearest intervals.

My data looks like this:

library(tidyverse)
library(lubridate)    
df <- data.frame(country = rep(c("Spain","Portugal"), each = 4), 
                     type = rep(c("1","2"), each = 4), 
                     name = rep(c("A","B"), each = 4), 
                     event_start = as.Date(c("2012-07-13", "2014-09-05", "2016-12-23", "2017-01-01", "2015-11-27", "2014-06-27", "2013-04-11", "2012-11-27")), 
                     event_end = as.Date(c("2014-09-04", "2016-12-22", "2016-12-31", "2017-01-09", "2016-02-10", "2014-11-26", "2014-06-26", "2013-04-10")), 
                     start = rep(as.Date(c("2008-10-01", "2017-01-01")), each = 4),
                     end = rep(as.Date(c("2008-12-31", "2017-12-31")), each = 4),
                     stringsAsFactors = FALSE) %>%
      mutate(event_interval = interval(event_start, event_end),
             int = interval(start, end))

Expected result:

country type name, event_start, event_end, start, end, event_interval, int
Spain 1 A 2012-07-13 2014-09-04 2008-10-01 2008-12-31 2012-07-13 UTC--2014-09-04 UTC 2008-10-01 UTC--2008-12-31 UTC
Portugal 2 B 2015-11-27 2016-02-10 2017-01-01 2017-12-31 2015-11-27 UTC--2016-02-10 UTC 2017-01-01 UTC--2017-12-31 UTC

In essence, I would like to keep — for each combination of country/type/name — the row for which event_interval is the nearest to int.

I've tried (and kind of succeeded) with a not-so-good-looking for loop but was wondering if you knew of something more straightforward with dplyr?

Cheers

EDIT To clarify, in the example above, event_int and int don't intersect, but this is not always true in my entire, proper dataset. In fact, for many combinations of country/type/name, there may be several event_int overlapping with int, so I really need to find which event_int is the most similar to int, i.e. the event_int that overlaps the most with int OR is the nearest to int.

Fred-LM
  • 300
  • 1
  • 11
  • 1
    How do you define "nearest"? – Mako212 Sep 06 '19 at 16:53
  • The `event_int` that overlaps the most with `int` OR is the nearest to `int` – Fred-LM Sep 06 '19 at 16:54
  • 1
    If your interval is `2019-01-01 to 2019-01-02` and you are going to choose the nearest interval from `2018-01-01 to 2018-12-31` and `2019-01-03 to 2019-12-31`, which one is closer? – Mako212 Sep 06 '19 at 16:56
  • What package is the interval() function from? – Bill O'Brien Sep 06 '19 at 16:56
  • Sorry @Bill O'Brien, I forgot to include the packages. I edited the post – Fred-LM Sep 06 '19 at 16:58
  • @Mako212, the overlapping one is, for my purpose, more important. If it would be easier to separate instances where both types of intervals overlap, and instances where they don't, that's OK – Fred-LM Sep 06 '19 at 17:07

1 Answers1

0

Based on the comments above, I figured it out. Asking my question really helped me clarify what I wanted to do, and I then found R / lubridate: Calculate number of overlapping days between two periods.

The following bit of code does what I need:

df <- df %>%
mutate(ndays = pmax(pmin(end, event_end) - pmax(start, event_start) + 1, pmax(pmin(end, event_end) - pmax(start, event_start) + 1))) %>%
  group_by(country, type, name) %>%
  arrange(country, type, name, desc(ndays)) %>%
  filter(row_number() == 1) #Keeps nearest or most overlapping record
Fred-LM
  • 300
  • 1
  • 11