Subsetting rows based on dates and criteria across two data frames

Question

I have one data frame outlining pollution levels continuously measured from two sites.

Dates <- as.data.frame(seq(as.Date("2015/01/01"), as.Date("2017/01/01"),"day"))
Pollution_Site.A <- as.data.frame(c(seq(from = 1, to = 366, by = 1),
                       (seq(from = 366, to = 1, by = -1))))
Pollution_Site.B <- as.data.frame(c(seq(from = 0, to = 365, by = 1),
                                (seq(from = 365, to = 0, by = -1))))
df1 <- cbind(Dates,Pollution_Site.A,Pollution_Site.B)
colnames(df1) <- c("Dates","Site.A","Site.B")

I have a separate data frame highlighting when surveyors (each site has one unique surveyor) visited each site.

Site<- c("Site.A","Site.A","Site.B","Site.B")
Survey_Dates <- as.data.frame(as.POSIXct(c("2014/08/17","2016/08/01",
                                      "2015/02/01","2016/10/31")))
df2 <- as.data.frame(cbind(Site,Survey_Dates))
colnames(df2) <- c("Site","Survey_Dates")

What I want to do is (i) define a high pollution event (although perhaps some form of 'apply' function would be better to do this iteratively across multiple sites)?

 High_limit_Site.A <- 1.5*median(df1$Site.A)
 High_limit_Site.B <- 1.5*median(df1$Site.B)

The I want to (ii) subset the second data frame to show which surveyors have visited the site before and after a high pollution event within 1 year (providing there is pollution data as well). I presume something along the 'difftime' function will work here, but am not sure how I would apply this.

Finally, I would like (iii) the subsetted data frame to highlight whether the surveyor was out before or after the pollution event.

So in the example above, the desired output should only contain Site B. This is because Site A's first survey date precedes the first pollution measurement AND was over a year before the high pollution event. Thank you in advance for any help on this.

What did you try? Where are you stuck? I'd suggesting merging your two data frames and then taking some subsets. — Gregor Thomas, Dec 19 '17 at 19:08
Thank you for the comment. I have broken the question down a little bit. I have used difftime functions before to define time series, but am not sure how I will be able to get this to work here at all. — James White, Dec 19 '17 at 19:36

score 3 · Answer 1 · answered Dec 19 '17 at 19:37

You need to pivot df1 and then cross-join it with df2

library(dplyr)
library(tidyr)

df1 %>% gather(key=Site, value=Pollution, -Dates) %>% 
  group_by(Site) %>% 
  mutate(HighLimit=as.numeric(Pollution>1.5*median(Pollution))) %>% 
  filter(HighLimit==1) %>% 
  # this will function as cross-join because Site is not a unique ID
  left_join(df2, by=c("Site")) %>% 
  mutate(Time_Lag = as.numeric(as.Date(Survey_Dates)-as.Date(Dates)),
         Been_Before = ifelse(Time_Lag>0, "after", "before")) %>% 
  filter(abs(Time_Lag)<365) %>% 
  group_by(Site, Survey_Dates, Been_Before) %>% 
  summarise(Event_date_min=min(Dates), 
            Event_date_max=max(Dates))

Here you can see earliest and latest event corresponding to each visit

# A tibble: 3 x 5
# Groups:   Site, Survey_Dates [?]
    Site Survey_Dates Been_Before Event_date_min Event_date_max
   <chr>       <dttm>       <chr>         <date>         <date>
1 Site.A   2016-08-01       after     2015-10-03     2016-04-01
2 Site.B   2015-02-01      before     2015-10-02     2016-01-30
3 Site.B   2016-10-31       after     2015-11-01     2016-04-02

score 0 · Accepted Answer · answered Dec 20 '17 at 09:44

Just to build on the answer @dmi3kno displayed above, I can then subset sites which contain both a "before" and "after" sign for each site.

Output_df <- df1 %>% gather(key=Site, value=Pollution, -Dates) %>% 
group_by(Site) %>% 
mutate(HighLimit=as.numeric(Pollution>1.5*median(Pollution))) %>% 
filter(HighLimit==1) %>% 
left_join(df2, by=c("Site")) %>% 
mutate(Time_Lag = as.numeric(as.Date(Survey_Dates)-as.Date(Dates)),
     Been_Before = ifelse(Time_Lag>0, "after", "before")) %>% 
filter(abs(Time_Lag)<365) %>% 
group_by(Site, Survey_Dates, Been_Before) %>% 
summarise(Event_date_min=min(Dates), 
        Event_date_max=max(Dates))

Then using dplyr again:

Final_df <- Output_df %>%
group_by(Site) %>%
filter(all(c("before", "after") %in% Been_Before))

Subsetting rows based on dates and criteria across two data frames

2 Answers2