0

Similarly to a question posted here, I want to compute number of overlapping days between two periods conditional on a third variable (location).

For each observation of the main dataset (DF) I have a starting and an end date, and a location (character) variable. The Events data comprises information on event location, starting date and end date. Multiple events in the same location and (partially) overlapping periods are allowed.

Thus for each observation in DF the period must be compared to other periods in an event dataset (Events). This means that the count of overlapping days between the between one (DF) and multiple periods (Events) must be done net of overlapping days between two (or more) periods in the Events dataset

An example of the data structure of my two data sources can be easily reproduced in R using this code (note that the location variable has been set to an integer for simplicity):

        set.seed(1)
    DF <- data.frame(
      start = sample(seq(as.Date('2018-01-01'), as.Date('2018-04-30'), by="day"), 20),
      end = sample(seq(as.Date('2018-05-01'), as.Date('2018-10-30'), by="day"), 20),
      location = sample(seq(1:5)),20)


    Events <- data.frame(
      start = sample(seq(as.Date('2018-01-01'), as.Date('2018-04-30'), by="day"), 30),
      end = sample(seq(as.Date('2018-05-01'), as.Date('2018-10-30'), by="day"), 30),
      location = sample(seq(1:5)),  30 )

In the simple case in which the Events data reduces to only one event (and we do not care about the location) counting overallping days for each obervation in DF can be done easily with the following code and dplyr: code taken from Matthew Lundberg answer here, also note that I have created another dataframe with a single event (One_event):

library(dplyr)

One_event <- data.frame(
  start = as.Date('2018-01-01'),
  end = as.Date('2018-07-30'))

DF %>%
  mutate(overlap = pmax(pmin(One_event$end, end) - pmax(One_event$start, start) + 1,0))

resulting in:

   start        end       location X20  overlap
1  2018-02-01 2018-10-19        5  20 180 days
2  2018-02-14 2018-06-08        3  20 115 days
3  2018-03-09 2018-08-26        4  20 144 days
4  2018-04-17 2018-05-23        2  20  37 days
5  2018-01-24 2018-06-17        1  20 145 days
6  2018-04-14 2018-07-08        5  20  86 days
7  2018-04-18 2018-05-03        3  20  16 days
8  2018-03-16 2018-07-07        4  20 114 days
9  2018-03-12 2018-09-30        2  20 141 days
10 2018-01-07 2018-06-29        1  20 174 days
11 2018-01-23 2018-07-23        5  20 182 days
12 2018-01-20 2018-08-12        3  20 192 days
13 2018-04-23 2018-07-24        4  20  93 days
14 2018-02-11 2018-06-01        2  20 111 days
15 2018-03-23 2018-09-17        1  20 130 days
16 2018-02-22 2018-08-21        5  20 159 days
17 2018-04-24 2018-09-10        3  20  98 days
18 2018-04-13 2018-05-18        4  20  36 days
19 2018-02-08 2018-08-28        2  20 173 days
20 2018-03-20 2018-10-23        1  20 133 days

Now back to the orginal problem. To allow comparison between the period of each observation in Data and the matching event(s) according to observation's and event's location I think that would be reasonable to use the apply function, subset the Event dataset according to the observation location, and finally run the mutate function for each row and a subset of the Events data (temp):

    apply(DF, 1, function(x) {
  temp =  Events[Events$location %in% x["location"]
 x %>%
      mutate(overlap = pmax(pmin(temp$end, end) - pmax(temp$start, start) + 
1,0))
})

There are several issues with this last part of code. First, does not work and gives an error message:

(Error in UseMethod("mutate_") : 
  no applicable method for 'mutate_' applied to an object of class "character")

Second, it does not account for two (or more periods) overlapping in the Events dataset.

Caserio
  • 472
  • 1
  • 3
  • 14

1 Answers1

1

are you looking for this:

apply(DF, MARGIN = 1, function(x) {
  Events[Events$location == x["location"],] %>% mutate(overlap = pmax(pmin(.data$end,         
  x["end"]) - pmax(.data$start, x["start"])))
})

This results in my case to:

[[1]]
   start        end location X30  overlap
1 2018-02-01 2018-07-28        5  30 177 days
2 2018-04-14 2018-08-27        5  30 135 days
3 2018-01-23 2018-09-20        5  30 231 days
4 2018-02-22 2018-09-10        5  30 200 days
5 2018-04-04 2018-07-17        5  30 104 days
6 2018-02-06 2018-05-16        5  30  99 days

[[2]]
   start        end location X30  overlap
1 2018-01-24 2018-09-26        3  30 114 days
2 2018-01-07 2018-07-11        3  30 114 days
3 2018-03-23 2018-10-28        3  30  77 days
4 2018-03-20 2018-08-22        3  30  80 days
5 2018-01-26 2018-05-12        3  30  87 days
6 2018-01-31 2018-07-02        3  30 114 days

[[3]]
   start        end location X30  overlap
1 2018-03-09 2018-07-29        4  30 142 days
2 2018-03-16 2018-05-19        4  30  64 days
3 2018-04-23 2018-09-11        4  30 125 days
4 2018-04-13 2018-07-19        4  30  97 days
5 2018-03-05 2018-07-10        4  30 123 days
6 2018-02-05 2018-07-20        4  30 133 days

...
floe
  • 417
  • 4
  • 12
  • This is certantly helping much. The code you posted produces the count of overlapping days between the period in DF (for each obs) and the ones in the Events dataset conditionally to the location variable. However, I would like to get a single count for each obs. in DF net of the eventually overalapping periods of the events. – Caserio Dec 06 '18 at 14:11
  • @Caserio can you give me an example? – floe Dec 06 '18 at 14:15
  • Sure. First of all I clarify that what I would like as output is the time of exposure to certain events. If two events occur simultaneously I do not want to double count the exposure. Let's consider the very first part of the output of your code: `[[1]] start end location X30 overlap 1 2018-02-01 2018-07-28 5 30 177 days 2 2018-04-14 2018-08-27 5 30 135 days` here the two events overlap from Apri 14th to July 28th. Thus, if only these two events were to be considered, the true exposure would be from Feb 1st to Aug 27h. – Caserio Dec 06 '18 at 14:28
  • @Caserio I understand, not so easy. Maybe a solution is as a first step to aggregate your EventDF to minimum overlapping events for each location. – floe Dec 06 '18 at 15:47