0

I have data like this:

Sample 1    04/01/18   01/02/18
Sample 2    01/01/18   05/01/18
Sample 4    01/01/16   01/02/16

My desired output is contains only Sample 1 and Sample 2 since these overalap in time. I can't think of a way to do this?

Sam Lipworth
  • 107
  • 4
  • 12

1 Answers1

0

Here is 1 approach. Merge the data to itself and only keep the rows where the dates overlap and IDs don't match

First create the data

sample = c('Sample 1', 'Sample 2', 'Sample 4')
start_date = c('04/01/18', '01/01/18', '01/01/16')
end_date = c('01/02/18', '05/01/18', '01/02/16')
df = data.frame(sample, start_date, end_date)
df

sample     start_date  end_date
Sample 1   04/01/18    01/02/18
Sample 2   01/01/18    05/01/18
Sample 4   01/01/16    01/02/16

Next, transform the strings to dates, so we will be able to do date comparison

df = transform(df, 
    end_date = as.Date(df$end_date, "%m/%d/%y"),
    start_date = as.Date(df$start_date, "%m/%d/%y")
)

Last step is to merge the data to a copy of itself. To do this, I add a field called 'dummy' to merge on, which creates a cartesian product of the data. To filter out the mismatches, I only keep where sample IDs are the different and dates overlap. Here is a good discussion of the logic. Lastly, I did some cleanup by dropping extra columns and resetting names to the original values

df %>% 
    mutate(dummy=TRUE) %>% 
    left_join(df %>% mutate(dummy=TRUE), by='dummy') %>%
    filter(start_date.x < end_date.y, end_date.x > start_date.y, sample.x!=sample.y) %>%
    select(grep('.x$', names(.)))  %>% 
    setNames(gsub('.x$','',names(.)))

sample     start_date   end_date
Sample 1   2018-04-01   2018-01-02
Sample 2   2018-01-01   2018-05-01

If you are doing this with data where there are multiple overlaps per record, you would also need to deduplicate the rows

danielson
  • 1,029
  • 7
  • 8