I have data like this:
Sample 1 04/01/18 01/02/18
Sample 2 01/01/18 05/01/18
Sample 4 01/01/16 01/02/16
My desired output is contains only Sample 1 and Sample 2 since these overalap in time. I can't think of a way to do this?
I have data like this:
Sample 1 04/01/18 01/02/18
Sample 2 01/01/18 05/01/18
Sample 4 01/01/16 01/02/16
My desired output is contains only Sample 1 and Sample 2 since these overalap in time. I can't think of a way to do this?
Here is 1 approach. Merge the data to itself and only keep the rows where the dates overlap and IDs don't match
First create the data
sample = c('Sample 1', 'Sample 2', 'Sample 4')
start_date = c('04/01/18', '01/01/18', '01/01/16')
end_date = c('01/02/18', '05/01/18', '01/02/16')
df = data.frame(sample, start_date, end_date)
df
sample start_date end_date
Sample 1 04/01/18 01/02/18
Sample 2 01/01/18 05/01/18
Sample 4 01/01/16 01/02/16
Next, transform the strings to dates, so we will be able to do date comparison
df = transform(df,
end_date = as.Date(df$end_date, "%m/%d/%y"),
start_date = as.Date(df$start_date, "%m/%d/%y")
)
Last step is to merge the data to a copy of itself. To do this, I add a field called 'dummy' to merge on, which creates a cartesian product of the data. To filter out the mismatches, I only keep where sample IDs are the different and dates overlap. Here is a good discussion of the logic. Lastly, I did some cleanup by dropping extra columns and resetting names to the original values
df %>%
mutate(dummy=TRUE) %>%
left_join(df %>% mutate(dummy=TRUE), by='dummy') %>%
filter(start_date.x < end_date.y, end_date.x > start_date.y, sample.x!=sample.y) %>%
select(grep('.x$', names(.))) %>%
setNames(gsub('.x$','',names(.)))
sample start_date end_date
Sample 1 2018-04-01 2018-01-02
Sample 2 2018-01-01 2018-05-01
If you are doing this with data where there are multiple overlaps per record, you would also need to deduplicate the rows