Similar to do.call/lapply approach here, and data.table approach here, but both have the setup of:
- MainDF with data and startdate/enddate ranges
- SubDF with a vector of single dates
Where the users are looking for summaries of all the MainDF ranges that overlap each SubDF date. I have
- MainDF with data and a vector of single dates
- SubDF with startdate/enddate ranges
And am looking to append summaries, to SubDF, for multiple rows of MainDF data which fall within each SubDF range. Example:
library(lubridate)
MainDF <- data.frame(Dates = seq.Date(from = as.Date("2020-02-12"),
by = "days",
length.out = 10),
DataA = 1:10)
SubDF <- data.frame(DateFrom = as.Date(c("2020-02-13", "2020-02-16", "2020-02-19")),
DateTo = as.Date(c("2020-02-14", "2020-02-17", "2020-02-21")))
SubDF$interval <- interval(SubDF$DateFrom, SubDF$DateTo)
Trying the data.table approach from the second link I figure it should be something like:
MainDF[SubDF, on = .(Dates >= DateFrom, Dates <= DateTo), allow = TRUE][
, .(SummaryStat = max(DataA)), by = .(Dates)]
But it errors with unused arguments for on
. On my actual data I got a result by using (the equivalent of) max(MainDF$DataA)
, but it was 3 repeats of the second value (In my actual data the final row won't run as it doesn't have a value for DateTo). I suspect using MainDF$
means I've subverting the grouping.
I suspect I'm close but I'm really struggling to get my head around the data.table
mindset for complex use cases. The summary stats I'm looking to do are (for example data):
Mean
&Max
of DataAlength(which(DataA > 3))
difftime(last(Dates), first(Dates), units = "mins")
Dates[which.max(DataA)]
I added the interval
line above as data.table
's %between%
help suggests one might be able to use a Dates %between% interval
format but it doesn't mention intervals/difftimes specifically in the text nor examples and my attempts are already failing elsewhere so I'm loathe to concentrate on improving my running while I can't walk!
I've focused on the data.table approach since it's used for a similar problem, but I've been wondering whether dplyr
's group_by
/group_by_if
could be used instead? group_by_if
's .predicate
seems to be constrained to tests on the columns (e.g. are they factors) rather than relating to data in the columns' rows, but I could be wrong.
Thanks in advance for any help!