Create breaks in one data.frame by time intervals in another: fuzzy join

Question

I record CO2 in df2 and have a list of experiment start and end times in d:

data.frame df2 that contains continuous CO2 measurements over time.

df2<-data.frame(CO2.ppm.=sample(300:500,72,replace=TRUE),Dev.Date.Time=seq(
     from=as.POSIXct("2012-1-1 0:00", tz="BST"),
     to=as.POSIXct("2012-1-3 23:00", tz="BST"),
     by="hour"
   ) )

I have a data.frame df1 with a continuous time variable called: Dev.Date.Time, a column called ExperimentID and the type of ExperimentType that was recorded. Note, there's a chunk of time where no experiment was taking place but don't need to remove it.

df1<-data.frame(ExperimentID=rep(1:12,each=6),ExperimentType=rep(c("IV","NoExperiment","Obs"),each=24),Dev.Date.Time=seq(
     from=as.POSIXct("2012-1-1 0:00", tz="BST"),
     to=as.POSIXct("2012-1-3 23:00", tz="BST"),
     by="hour"
   ) )

I have then created another data.frame d with start and end times of each experiments.

startTime<-aggregate(data=df1,Dev.Date.Time~ExperimentID+ExperimentType,head,1)
endTime<-aggregate(data=df1,Dev.Date.Time~ExperimentID+ExperimentType,tail,1)

d<-inner_join(startTime, endTime, by=c("ExperimentID","ExperimentType"))

I'd like to create a column in df2 called ExperimentID and another one called ExperimentType based on the start and stop times that I found in d

I'm trying the following that makes the breaks but I can't work out how to make the labels match. Any thoughts are much appreciated.

Originally I thought about using cut. While it made the breaks I wanted I wasn't any closer to labelling them by ExperimentID.

breakz <- as_tibble(lubridate::ymd_hms(d$Dev.Date.Time.x,d$Dev.Date.Time.y))
    breakz<-dplyr::arrange(breakz,value)
    df1$ActivityID<-cut(df1$Dev.Date.Time,breaks=unique(breakz$value), labels = c(d$ExperimentID,d$ExperimentType)

EDIT:

Based on suggestions in the comments I'm trying fuzzyjoin because in reality the time-stamps don't match exactly. So I need merge by an interval.

require(fuzzy join)

df3<-(fuzzy_right_join(
  d, df2,
  by = c(
    "StartTime" = "Dev.Date.Time",
    "EndTime" = "Dev.Date.Time"
  ),
  match_fun = list( `>=`, `<=`)
))

Produces NA in all df3$ExperimentID. Any thoughts?

Thank you for reading though this! Ohhh so don't need to do cut at all? My computer is still running it as my dfs are 1.5million obs. — HCAI, Sep 09 '19 at 14:12
At least from your sample, the two time series from `df1` and `df2` are identical, so no. — Quang Hoang, Sep 09 '19 at 14:14
I'll have to test this out and get back to you. Just thinking, what do you mean by interpolate in this case? — HCAI, Sep 09 '19 at 14:21
after you outer join, you would have a `na` at those timestamps with no information, you can fill these `na` with those around it. For example `IV IV na na IV` can be turned into `IV IV IV IV IV`. — Quang Hoang, Sep 09 '19 at 14:23
Sorry, there's a subtlety I'm not quite grasping. d contains start and end-times, so how does an outer merge handle this? — HCAI, Sep 09 '19 at 20:30
I think a fuzzyjoin might be the answer. https://community.rstudio.com/t/tidy-way-to-range-join-tables-on-an-interval-of-dates/7881 I’ll try this in the morning — HCAI, Sep 10 '19 at 05:23

score 0 · Accepted Answer · edited Jun 20 '20 at 09:12

This WORKS and is FAST! Complains about timezones but seems to be OK

FROM:

https://www.r-bloggers.com/in-between-a-rock-and-a-conditional-join/

library(data.table)
  # Attempt #4: Use the data.table package
  myDataDT <- data.table(df2)
  myDataDT[, SomeValueHelp := Dev.Date.Time]
  linkTableDT <- data.table(d)
  setkey(linkTableDT, StartTime, EndTime)
  
  df3 <- foverlaps(myDataDT, linkTableDT, by.x=c('Dev.Date.Time', 'SomeValueHelp'), 
                      by.y=c('StartTime', 'EndTime'))
  
  df3[sample(nrow(df3), 3),]

Create breaks in one data.frame by time intervals in another: fuzzy join

1 Answers1

https://www.r-bloggers.com/in-between-a-rock-and-a-conditional-join/