Is there a faster alternative to the date extension function using the zoo package?

Question

The code posted at the bottom does a clean job of extending dates expressed both (a) as elapsed number of months per Period_1 in the code and (b) in YYYY-MM format defined as Period_2. Item (b) uses the zoo package as.yearmon() function for extending the YYYY-MM date.

However, when I run this code against the 2m+ row data file this is intended for, running Period_1 (item (a) above) is very fast (only 3.3 seconds) but running the code for Period_2 (item (b) above) is very slow (taking 1.74 minutes total). Is there a fast alternative to the date extensions for Period_2, for example using data.table package or just dplyr?

Code:

start_time <- Sys.time()

library(dplyr)
library(tidyr)
library(zoo)

testDF <-
  data.frame(
    ID = as.numeric(c(rep(1,5),rep(50,3),rep(60,3))),
    Period_1 = as.numeric(c(1:5,1:3,1:3)),
    Period_2 = c("2012-06","2012-07","2012-08","2012-09","2012-10","2013-06","2013-07","2013-08","2012-10","2012-11","2012-12"),
    Bal = as.numeric(c(rep(10,5),21:23,36:34)),
    State = c("XX","AA","BB","CC","XX","AA","BB","CC","SS","XX","AA")
  )

testPeriod_1 <-
  testDF %>%
  tidyr::complete(ID, nesting(Period_1)) %>%
  tidyr::fill(Bal, State, .direction = "down")

testPeriod_2 <- testPeriod_1 %>%
  group_by(ID) %>%
  mutate(Period_2 = as.yearmon(first(Period_2)) + seq(0, by=1/12, length=n())) %>%
  mutate(Period_2 = format(Period_2, "%Y-%m")) %>%
  ungroup

end_time <- Sys.time()
end_time - start_time

Already tried all the alternatives given in your [earlier question](https://stackoverflow.com/q/74515720/6574038)? You can use `microbenchmark::microbenchmark()`. — jay.sf, Nov 22 '22 at 06:59
Hi jay.sf, I'm trying those alternatives now (adjusting for turn of year) and will report back — Village.Idyot, Nov 22 '22 at 07:31
My original answer that this comes from also discussed not formatting. That is an expensive operation that likely takes half the time. Just omit the `format` line. — G. Grothendieck, Nov 22 '22 at 09:59
Yes, G. Grothendieck that helps. Your solution posted at stackoverflow.com/questions/74515720/… is indeed fast. Eliminating format helped. — Village.Idyot, Nov 22 '22 at 10:37

score 1 · Accepted Answer · answered Nov 22 '22 at 07:23

Avoid the group-by:

library(hutilscpp)

testPeriod_1 %>%
  mutate(Period_2 = as.yearmon(Period_2)) %>%
  mutate(Period_2 = cumsum_reset(is.na(Period_2), is.na(Period_2) / 12) + na.locf(Period_2)) %>%
  mutate(Period_2 = format(Period_2, "%Y-%m"))

#Unit: milliseconds
#                expr     min      lq      mean   median       uq     max neval
# testPeriod_2_faster  6.0256  6.4595  7.147424  6.54710  6.74730 19.3891   100
#        testPeriod_2 16.3351 16.8705 18.686059 17.27665 18.46105 31.5112   100

This assumes the first year-month of a group is never NA and you only need to extrapolate.

Yes this seems quite fast even when I run against the entire file. The format() is taxing and could be dropped. — Village.Idyot, Nov 22 '22 at 10:56

Is there a faster alternative to the date extension function using the zoo package?

1 Answers1

Linked