0

The code posted at the bottom does a clean job of extending dates expressed both (a) as elapsed number of months per Period_1 in the code and (b) in YYYY-MM format defined as Period_2. Item (b) uses the zoo package as.yearmon() function for extending the YYYY-MM date.

However, when I run this code against the 2m+ row data file this is intended for, running Period_1 (item (a) above) is very fast (only 3.3 seconds) but running the code for Period_2 (item (b) above) is very slow (taking 1.74 minutes total). Is there a fast alternative to the date extensions for Period_2, for example using data.table package or just dplyr?

Code:

start_time <- Sys.time()

library(dplyr)
library(tidyr)
library(zoo)

testDF <-
  data.frame(
    ID = as.numeric(c(rep(1,5),rep(50,3),rep(60,3))),
    Period_1 = as.numeric(c(1:5,1:3,1:3)),
    Period_2 = c("2012-06","2012-07","2012-08","2012-09","2012-10","2013-06","2013-07","2013-08","2012-10","2012-11","2012-12"),
    Bal = as.numeric(c(rep(10,5),21:23,36:34)),
    State = c("XX","AA","BB","CC","XX","AA","BB","CC","SS","XX","AA")
  )

testPeriod_1 <-
  testDF %>%
  tidyr::complete(ID, nesting(Period_1)) %>%
  tidyr::fill(Bal, State, .direction = "down")

testPeriod_2 <- testPeriod_1 %>%
  group_by(ID) %>%
  mutate(Period_2 = as.yearmon(first(Period_2)) + seq(0, by=1/12, length=n())) %>%
  mutate(Period_2 = format(Period_2, "%Y-%m")) %>%
  ungroup

end_time <- Sys.time()
end_time - start_time
Village.Idyot
  • 1,359
  • 2
  • 8
  • Already tried all the alternatives given in your [earlier question](https://stackoverflow.com/q/74515720/6574038)? You can use `microbenchmark::microbenchmark()`. – jay.sf Nov 22 '22 at 06:59
  • Hi jay.sf, I'm trying those alternatives now (adjusting for turn of year) and will report back – Village.Idyot Nov 22 '22 at 07:31
  • 1
    My original answer that this comes from also discussed not formatting. That is an expensive operation that likely takes half the time. Just omit the `format` line. – G. Grothendieck Nov 22 '22 at 09:59
  • Yes, G. Grothendieck that helps. Your solution posted at stackoverflow.com/questions/74515720/… is indeed fast. Eliminating format helped. – Village.Idyot Nov 22 '22 at 10:37

1 Answers1

1

Avoid the group-by:

library(hutilscpp)

testPeriod_1 %>%
  mutate(Period_2 = as.yearmon(Period_2)) %>%
  mutate(Period_2 = cumsum_reset(is.na(Period_2), is.na(Period_2) / 12) + na.locf(Period_2)) %>%
  mutate(Period_2 = format(Period_2, "%Y-%m"))

#Unit: milliseconds
#                expr     min      lq      mean   median       uq     max neval
# testPeriod_2_faster  6.0256  6.4595  7.147424  6.54710  6.74730 19.3891   100
#        testPeriod_2 16.3351 16.8705 18.686059 17.27665 18.46105 31.5112   100

This assumes the first year-month of a group is never NA and you only need to extrapolate.

Roland
  • 127,288
  • 10
  • 191
  • 288