0

Similar to do.call/lapply approach here, and data.table approach here, but both have the setup of:

  • MainDF with data and startdate/enddate ranges
  • SubDF with a vector of single dates

Where the users are looking for summaries of all the MainDF ranges that overlap each SubDF date. I have

  • MainDF with data and a vector of single dates
  • SubDF with startdate/enddate ranges

And am looking to append summaries, to SubDF, for multiple rows of MainDF data which fall within each SubDF range. Example:

library(lubridate)
MainDF <- data.frame(Dates = seq.Date(from = as.Date("2020-02-12"),
                                      by = "days",
                                      length.out = 10),
                     DataA = 1:10)
SubDF <- data.frame(DateFrom = as.Date(c("2020-02-13", "2020-02-16", "2020-02-19")),
                    DateTo = as.Date(c("2020-02-14", "2020-02-17", "2020-02-21")))
SubDF$interval <- interval(SubDF$DateFrom, SubDF$DateTo)

Trying the data.table approach from the second link I figure it should be something like:

MainDF[SubDF, on = .(Dates >= DateFrom, Dates <= DateTo), allow = TRUE][
, .(SummaryStat = max(DataA)), by = .(Dates)]

But it errors with unused arguments for on. On my actual data I got a result by using (the equivalent of) max(MainDF$DataA), but it was 3 repeats of the second value (In my actual data the final row won't run as it doesn't have a value for DateTo). I suspect using MainDF$ means I've subverting the grouping.

I suspect I'm close but I'm really struggling to get my head around the data.table mindset for complex use cases. The summary stats I'm looking to do are (for example data):

  • Mean & Max of DataA
  • length(which(DataA > 3))
  • difftime(last(Dates), first(Dates), units = "mins")
  • Dates[which.max(DataA)]

I added the interval line above as data.table's %between% help suggests one might be able to use a Dates %between% interval format but it doesn't mention intervals/difftimes specifically in the text nor examples and my attempts are already failing elsewhere so I'm loathe to concentrate on improving my running while I can't walk!

I've focused on the data.table approach since it's used for a similar problem, but I've been wondering whether dplyr's group_by/group_by_if could be used instead? group_by_if's .predicate seems to be constrained to tests on the columns (e.g. are they factors) rather than relating to data in the columns' rows, but I could be wrong.

Thanks in advance for any help!

dez93_2000
  • 1,730
  • 2
  • 23
  • 34
  • 1
    for *But it errors with unused arguments for on*, you need to convert your dataset to `data.table` first using `setDT(MainDF)` and `setDT(SubDF)` – chinsoon12 Feb 13 '20 at 00:38
  • 1
    As @chinsoon12 pointed out, your code works when you `setDT` both of your data.frames. – PavoDive Feb 13 '20 at 02:43
  • oh goodness... thanks both. I had `as.data.table` for one of them but evidently not both, and maybe not the right syntax. I'll give it a try tomorrow when my megaloop has finished! Cheers :) – dez93_2000 Feb 13 '20 at 03:51

0 Answers0