Similarly to a question posted here, I want to compute number of overlapping days between two periods conditional on a third variable (location).
For each observation of the main dataset (DF) I have a starting and an end date, and a location (character) variable. The Events data comprises information on event location, starting date and end date. Multiple events in the same location and (partially) overlapping periods are allowed.
Thus for each observation in DF the period must be compared to other periods in an event dataset (Events). This means that the count of overlapping days between the between one (DF) and multiple periods (Events) must be done net of overlapping days between two (or more) periods in the Events dataset
An example of the data structure of my two data sources can be easily reproduced in R using this code (note that the location variable has been set to an integer for simplicity):
set.seed(1)
DF <- data.frame(
start = sample(seq(as.Date('2018-01-01'), as.Date('2018-04-30'), by="day"), 20),
end = sample(seq(as.Date('2018-05-01'), as.Date('2018-10-30'), by="day"), 20),
location = sample(seq(1:5)),20)
Events <- data.frame(
start = sample(seq(as.Date('2018-01-01'), as.Date('2018-04-30'), by="day"), 30),
end = sample(seq(as.Date('2018-05-01'), as.Date('2018-10-30'), by="day"), 30),
location = sample(seq(1:5)), 30 )
In the simple case in which the Events data reduces to only one event (and we do not care about the location) counting overallping days for each obervation in DF can be done easily with the following code and dplyr
: code taken from Matthew Lundberg answer here, also note that I have created another dataframe with a single event (One_event):
library(dplyr)
One_event <- data.frame(
start = as.Date('2018-01-01'),
end = as.Date('2018-07-30'))
DF %>%
mutate(overlap = pmax(pmin(One_event$end, end) - pmax(One_event$start, start) + 1,0))
resulting in:
start end location X20 overlap
1 2018-02-01 2018-10-19 5 20 180 days
2 2018-02-14 2018-06-08 3 20 115 days
3 2018-03-09 2018-08-26 4 20 144 days
4 2018-04-17 2018-05-23 2 20 37 days
5 2018-01-24 2018-06-17 1 20 145 days
6 2018-04-14 2018-07-08 5 20 86 days
7 2018-04-18 2018-05-03 3 20 16 days
8 2018-03-16 2018-07-07 4 20 114 days
9 2018-03-12 2018-09-30 2 20 141 days
10 2018-01-07 2018-06-29 1 20 174 days
11 2018-01-23 2018-07-23 5 20 182 days
12 2018-01-20 2018-08-12 3 20 192 days
13 2018-04-23 2018-07-24 4 20 93 days
14 2018-02-11 2018-06-01 2 20 111 days
15 2018-03-23 2018-09-17 1 20 130 days
16 2018-02-22 2018-08-21 5 20 159 days
17 2018-04-24 2018-09-10 3 20 98 days
18 2018-04-13 2018-05-18 4 20 36 days
19 2018-02-08 2018-08-28 2 20 173 days
20 2018-03-20 2018-10-23 1 20 133 days
Now back to the orginal problem.
To allow comparison between the period of each observation in Data and the matching event(s) according to observation's and event's location I think that would be reasonable to use the apply
function, subset the Event dataset according to the observation location, and finally run the mutate function for each row and a subset of the Events data (temp):
apply(DF, 1, function(x) {
temp = Events[Events$location %in% x["location"]
x %>%
mutate(overlap = pmax(pmin(temp$end, end) - pmax(temp$start, start) +
1,0))
})
There are several issues with this last part of code. First, does not work and gives an error message:
(Error in UseMethod("mutate_") :
no applicable method for 'mutate_' applied to an object of class "character")
Second, it does not account for two (or more periods) overlapping in the Events dataset.