Reduce processing time for calculating coefficients

Question

I have a database, a function, and from that, I can get coef value (it is calculated through lm function). There are two ways of calculating: the first is if I want a specific coefficient depending on an ID, date and Category and the other way is calculating all possible coef, according to subset_df1.

The code is working. For the first way, it is calculated instantly, but for the calculation of all coefs, it takes a reasonable amount of time, as you can see. I used the tictoc function just to show you the calculation time, which gave 633.38 sec elapsed. An important point to highlight is that df1 is not such a small database, but for the calculation of all coef I filter, which in this case is subset_df1.

I made explanations in the code so you can better understand what I'm doing. The idea is to generate coef values for all dates >= to date1.

Finally, I would like to try to reasonably decrease this processing time for calculating all coef values.

library(dplyr)
library(tidyr)
library(lubridate)
library(tictoc)

#database
df1 <- data.frame( Id = rep(1:5, length=900),
                   date1 =  as.Date( "2021-12-01"),
                   date2= rep(seq( as.Date("2021-01-01"), length.out=450, by=1), each = 2),
                   Category = rep(c("ABC", "EFG"), length.out = 900),
                   Week = rep(c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday",
                                "Saturday", "Sunday"), length.out = 900),
                   DR1 = sample( 200:250, 900, repl=TRUE),  
                   setNames( replicate(365, { sample(0:900, 900)}, simplify=FALSE),
                             paste0("DRM", formatC(1:365, width = 2, format = "d", flag = "0"))))
                             
return_coef <- function(df1,idd,dmda,CategoryChosse) {
  
  # First idea: Calculate the median of the values resulting from the subtraction between DR01 and the values of the DRM columns
  
  subsetDRM<-  df1 %>% select(starts_with("DRM")) 
  
  DR1_subsetDRM<-cbind (df1, setNames(df1$DR1 - subsetDRM, paste0(names(subsetDRM), "_PV"))) 
  
  subset_PV<-select(DR1_subsetDRM,Id, date2,Week, Category, DR1, ends_with("PV")) 
  
  result_median<-subset_PV %>%
    group_by(Id,Category,Week) %>%
    dplyr::summarize(dplyr::across(ends_with("PV"), median),.groups = 'drop')
  
  # Second idea: After obtaining the median, I add the values found with the values of the DRM columns of my df1 database.
  
  Sum_DRM_result_median<-df1%>%
    inner_join(result_median, by = c('Id','Category', 'Week')) %>%
    mutate(across(matches("^DRM\\d+$"), ~.x + get(paste0(cur_column(), '_PV')),
                  .names = '{col}_{col}_PV')) %>%
    select(Id:Category, DRM01_DRM01_PV:last_col())
  
  Sum_DRM_result_median<-data.frame(Sum_DRM_result_median)
  
  # Third idea: The idea here is to specifically filter a line from Sum_DRM_result_median, which will depend on what the user chooses, for that it will be necessary to choose an Id, date and Category.
  
  # This code remove_values_0 I use because sometimes i have values equal to zero so i remove these columns ((this question was solved here: https://stackoverflow.com/questions/69452882/delete-column-depending-on-the-date-and-code-you-choose)  
  remove_values_0 <- df1 %>%
    dplyr::filter(Id==idd,date2 == ymd(dmda), Category == CategoryChosse) %>%
    select(starts_with("DRM")) %>%
    pivot_longer(cols = everything()) %>%
    arrange(desc(row_number())) %>%
    mutate(cs = cumsum(value)) %>%
    dplyr::filter(cs == 0) %>%
    pull(name)
  (dropnames <- paste0(remove_values_0,"_",remove_values_0, "_PV"))
  
  filterid_date_category <- Sum_DRM_result_median %>%
    filter(Id==idd,date2 == ymd(dmda), Category == CategoryChosse) %>%
    select(-any_of(dropnames))
  
  #Fourth idea: After selecting the corresponding row, I need to select the datas for coef calculation. For this, I delete some initial lines, which will depend on the day chosen.
  
  datas <-filterid_date_category %>%
    filter(Id==idd,date2 == ymd(dmda)) %>%
    group_by(Category) %>%
    summarize(across(starts_with("DRM"), sum),.groups = 'drop') %>%
    pivot_longer(cols= -Category, names_pattern = "DRM(.+)", values_to = "val") %>%
    mutate(name = readr::parse_number(name))
  colnames(datas)[-1]<-c("days","numbers")
  
  datas <- datas %>% 
    group_by(Category) %>% 
    slice((ymd(dmda) - min(as.Date(df1$date1) [
      df1$Category == first(Category)])):max(days)+1) %>%
    ungroup
  
  # After I calculate the datas dataset, I used the lm function to obtain the coef value.
  
  mod <- lm(numbers ~ I(days^2), datas)
  coef<-coef(mod)[1]
  val<-as.numeric(coef(mod)[1])
  
  return(val)
  
}

To calculate the coef of a specific ID, Date and Category in my df1 database, I do:

return_coef(df1,"2","2021-12-10","ABC")
[1] 209.262 # This value may vary, as the values in my df1 database vary

To calculate all the coef, I do:

tic()
subset_df1 <- subset(df1, date2 >= date1)

All<-subset_df1%>%
   transmute(
     Id,date2,Category,
     coef = mapply(return_coef, list(cur_data()), Id, as.Date(date2), Category))
toc()
633.38 sec elapsed

Your question would have lot more appeal if you described what your code does in words. — Gregor Thomas, Jan 13 '22 at 15:12
Gregor, do you find it interesting that I comment in the lines of the code what each thing does? Or does it say describe in the question in general what the code is doing? — Antonio, Jan 13 '22 at 15:14
In general: profile the code, then make it do less (such as avoiding making copies) and use appropriate data structure(s). Many of the functions used in the code already have dedicated SO topics on their performance, those are a good place to start. To improve your post, please include a realistic target, and why the current code does not meet those expectations. — Donald Seinen, Jan 13 '22 at 15:15
My preference would be a short summary in words of the overall goal, and then comments not for individual lines, but for blocks of code. *"Here's some code, can it be faster?"* is bad. *"Here's some code where I fit models to each group of data defined by column x and extract the coefficients into a data frame"* is better. *"Here's some code where I fit models to each group of data defined by column x and extract the coefficients into a data frame. The steps where I do y and z are particularly slow, can they be sped up?"* is much better. — Gregor Thomas, Jan 13 '22 at 15:29
Please choose a title which accurately reflects your problem as to help future users with a similar issue. As it stands, it's just a generic statement. — user438383, Jan 13 '22 at 15:33
As Donald suggests, a bit of profiling is always a good place to start with performance issues. [Here's an article on profiling in RStudio](https://support.rstudio.com/hc/en-us/articles/218221837-Profiling-R-code-with-the-RStudio-IDE). If 90% of the execution time of your code is in the `lm()` call, then you could make the rest of the code hyperefficient and still only see a 10% reduction in run time. — Gregor Thomas, Jan 13 '22 at 15:33
Also, your code wouldn't need so much explanation and comments if understandable names were used--it seems like `df1` is your input. What's `datas`? What are `x` and `m`? What's `mat1`? Why the `1` in `df1` and `mat1`? In other places, you have names that are probably meaningful to you, but not to us, so explaining in text or a comment would help. What do `PV` and `SPV` mean? And `idd` and `dmda`? — Gregor Thomas, Jan 13 '22 at 15:37
Thanks @Gregor Thomas, for the excellent explanations, I will adjust what you told me. I have a brief question: one possibility could be to use the `tictoc` function to know the time of each command and try to optimize that, or is it easy to use `profvis`, which analyzes the time of all commands separately at once? — Antonio, Jan 13 '22 at 16:02
I think it will be easier to use `profvis`. Try profiling on data that is at least medium size. Bigger than the sample you share here with us. — Gregor Thomas, Jan 13 '22 at 16:04
Hi @Gregor, I made adjustments to the question and code. Could you kindly give feedback if it's more understandable? Thanks again! — Antonio, Jan 19 '22 at 18:00
This is not an answer based on a deep study of your code, but a general advice for time calculus issues : may be you should use data.table instead of dplyr ... — MrSmithGoesToWashington, Jan 21 '22 at 09:19
Thans for the reply @MrSmithGoesToWashington! Do you think I should change to `data.table` in all `dplyr` I've inserted? — Antonio, Jan 21 '22 at 15:45

ekoam · Accepted Answer · 2022-01-23T05:57:30.577

There are too many issues in your code. We need to work from scratch. In general, here are some major concerns:

Don't do expensive operations so many times. Things like pivot_* and *_join are not cheap since they change the structure of the entire dataset. Don't use them so freely as if they come with no cost.
Do not repeat yourself. I saw filter(Id == idd, Category == ...) several times in your function. The rows that are filtered out won't come back. This is just a waste of computational power and makes your code unreadable.
Think carefully before you code. It seems that you want the regression results for multiple idd, date2 and Category. Then, should the function be designed to only take scalar inputs so that we can run it many times each involving several expensive data operations on a relatively large dataset, or should it be designed to take vector inputs, do fewer operations, and return them all at once? The answer to this question should be clear.

Now I will show you how I would approach this problem. The steps are

Find the relevant subset for each group of idd, dmda and CategoryChosse at once. We can use one or two joins to find the corresponding subset. Since we also need to calculate the median for each Week group, we would also want to find the corresponding dates that are in the same Week group for each dmda.
Pivot the data from wide to long, once and for all. Use row id to preserve row relationships. Call the column containing those "DRMXX" day and the column containing values value.
Find if trailing zeros exist for each row id. Use rev(cumsum(rev(x)) != 0) instead of a long and inefficient pipeline.
Compute the median-adjusted values by each group of "Id", "Category", ..., "day", and "Week". Doing things by group is natural and efficient in a long data format.
Aggregate the Week group. This follows directly from your code, while we will also filter out days that are smaller than the difference between each dmda and the corresponding date1 for each group.
Run lm for each group of Id, Category and dmda identified.
Use data.table for greater efficiency.
(Optional) Use a different median function rewritten in c++ since the one in base R (stats::median) is a bit slow (stats::median is a generic method considering various input types but we only need it to take numerics in this case). The median function is adapted from here.

Below shows the code that demonstrates the steps

Rcpp::sourceCpp(code = '
#include <Rcpp.h>

// [[Rcpp::export]]
double mediancpp(Rcpp::NumericVector& x, const bool na_rm) {
  std::size_t m = x.size();
  if (m < 1) Rcpp::stop("zero length vector not allowed.");
  if (!na_rm) {
    for (Rcpp::NumericVector::iterator i = x.begin(); i != x.end(); ++i)
      if (Rcpp::NumericVector::is_na(*i)) return *i;
  } else {
    for (Rcpp::NumericVector::iterator i = x.begin(); i != x.begin() + m; )
      Rcpp::NumericVector::is_na(*i) ? std::iter_swap(i, x.begin() + --m) : (void)++i;
  }
  if (m < 1) return x[0];

  std::size_t n = m / 2;
  std::nth_element(x.begin(), x.begin() + n, x.begin() + m);

  return m % 2 ? x[n] : (x[n] + *std::max_element(x.begin(), x.begin() + n)) / 2.;
}
')

dt_return_intercept <- function(dt1, idd, dmda, category) {
  # type checks
  stopifnot(
    data.table::is.data.table(dt1), 
    length(idd) == length(dmda), 
    length(idd) == length(category)
  )
  dmda <- switch(
    class(dt1$date2), 
    character = as.character(dmda), Date = as.Date(dmda, "%Y-%m-%d"), 
    stop("non-comformable types between `dmda` and `dt1$date2`")
  )
  idd <- as(idd, class(dt1$Id))
  
  # find subsets
  DT <- data.table::setDT(list(Id = idd, date2 = dmda, Category = category, order = seq_along(idd)))
  DT <- dt1[
    dt1[DT, .(Id, Category, date2, Week, order), on = .NATURAL], 
    on = .(Id, Category, Week), allow.cartesian = TRUE
  ]
  DT[, c("rowid", "date1", "date2", "i.date2") := c(
    list(seq_len(.N)), lapply(.SD, as.Date, "%Y-%m-%d")
  ), .SDcols = c("date1", "date2", "i.date2")]
  
  # pivot + type conversion
  DT <- data.table::melt(DT, measure = patterns("DRM(\\d+)"), variable = "day")
  DT[, `:=`(day = as.integer(sub("^\\D+", "", day)), value = as.numeric(value))]
  
  # computations
  DT[, keep := rev(cumsum(rev(value)) != 0), by = "rowid"]
  DT[, value := value +  mediancpp(DR1 - value, TRUE), 
     by = c("Id", "Category", "i.date2", "date1", "day", "Week")]
  DT <- DT[date2 == i.date2 & keep & day > i.date2 - date1, 
           .(value = sum(value), order = order[[1L]]), 
           by = c("Id", "Category", "i.date2", "date1", "day")]
  DT[, .(out = coef(lm(value ~ I(day^2), .SD))[[1L]], order = order[[1L]]), # coef(...)[[1L]] gives you the intercept, not the coefficient of day^2. Are you sure this is what you want?
     by = c("Id", "Category", "i.date2")][order(order)]$out
}

Benchmark

params <- (params <- unique(df1[df1$date1 <= df1$date2, c(1L, 3L, 4L)]))[sample.int(nrow(params), 20L), ]
dt1 <- data.table::setDT(data.table::copy(df1)) # nothing but a data.table version of `df1`
microbenchmark::microbenchmark(
  mapply(function(x, y, z) return_coef(df1, x, y, z), 
         params$Id, params$date2, params$Category), 
  dt_return_intercept(dt1, params$Id, params$date2, params$Category), 
  dt_return_intercept_base(dt1, params$Id, params$date2, params$Category), # use stats::median instead of mediancpp
  times = 10L, check = "equal"
)

Results are as follows. No error is thrown with check="equal". This means all three functions return the same result. This function is about 136x faster than yours with mediancpp, and about 73x faster than yours with stats::median. To avoid copy, mediancpp takes its first argument by reference. Therefore, it needs to be used with caution. This behavior fits well in this case since DR1 - value creates a temporary object that affects none of our variables.

Unit: milliseconds
                                                                                         expr        min         lq        mean      median         uq        max neval
mapply(function(x, y, z) return_coef(df1, x, y, z), params$Id, params$date2, params$Category) 11645.1729 11832.4373 11902.36716 11902.95195 11979.4154 12145.1154    10
                           dt_return_intercept(dt1, params$Id, params$date2, params$Category)    68.3173    72.4008    87.14596    75.24725    88.6007   167.2546    10
                      dt_return_intercept_base(dt1, params$Id, params$date2, params$Category)   153.9713   157.0826   163.18133   162.12175   167.2681   176.6866    10

@ekoam, I am working with my brother Antonio. Could you take a look at this question please: https://stackoverflow.com/questions/71682213/error-in-order1l-subscript-out-of-bounds-in-r I'm using your code. — , Mar 30 '22 at 21:18
@ekoam, sorry, but I have a brief question: I was using this `dt_return_intercept(dt1, "2", "2021-12-10", "ABC")` for the day specific. How do you `coef` for all dates?? I saw that you made `mapply`, but you joined `return_coef`, so I got confused. How would it be just for your function? — Antonio, Mar 30 '22 at 21:43

Reduce processing time for calculating coefficients

1 Answers1