1

I have a data frame "customers" build of customer id, month and total purchases that month. I'm trying to calculate a running slope for a window of 12 months using robust regression.

I have tried the following:

Coef <- function(x) {return(rlm(cbind(x)~cbind(1:length(x)))$coefficients[2])}
customer_slope = customers %>% mutate(slope = runner(x=total_purchases,k=12,f=Coef))

I get the following error:

x 'x' is singular: singular fits are not implemented in 'rlm'

If I run a single example, the function returns what I've expected:

Coef(c(4,11,7,15,5,14,8,9,14,17,14,13))

cbind(1:length(x)) 0.6888112

Alex
  • 149
  • 1
  • 1
  • 11
  • 1
    Check out the slide package available you want to use the `mutate(value=slide_dbl(total_purchases, Cord, .before=2)` – MDEWITT Aug 05 '20 at 09:28
  • Hi, slide_dbl returns the same error of singularity – Alex Aug 05 '20 at 13:51
  • @Alex try this function to debug: `Coef <- function(x) { res <- tryCatch(rlm(cbind(x) ~ cbind(1:length(x)))$coefficients[2], error = function(e) e); if (is(res, "error")) browser();return(res)}` – GoGonzo Aug 06 '20 at 15:44

3 Answers3

1

So I ran into similar problems and finally came to the below solution using slider. This provides a 3 days rolling estimate (of course you can change as you see fit). This doesn't quite get to your answer (which you could probably get with loops), but most of the way there.

library(MASS)
library(dplyr)
library(slider)

dat <- tibble::tibble(customers = c(4,11,7,15,5,14,8,9,14,17,14,13)) %>% 
  mutate(t = 1:n() %>% as.numeric())

dat %>% 
  mutate(results = slide_dbl(.x = .,
                             .f = ~rlm(customers ~ t, k = 12, data = .x)$coefficients[2],
                             .before = 2,
                             .complete = T))
MDEWITT
  • 2,338
  • 2
  • 12
  • 23
0

It look like that's the way to go, thanks! It seems like what caused the singularity was that I didn't change the default .complete from F to T. So, combined with your suggestion, this is how I made it work (took about two hours for 3M rows I did have however more complex group_by involved which is not shown below)

slope_rlm <- function(x) {
  x=as.numeric(x)
  prep = tibble(data=x)%>%mutate(t=1:n()%>%as.numeric())
  return(rlm(data~t,data=prep)$coefficients[2])
}


customers_rlm = customers %>% 
  mutate(cust_rlm_12=slide_dbl(total_purchases,slope_rlm,.before=11,.complete=T))
Alex
  • 149
  • 1
  • 1
  • 11
0

Consider data with two customers with data from 1000 days span. total_purchases are cumulated by customer, and each purchase size is ~pois(5).

set.seed(1)
customers <- data.frame(
  id = factor(rep(1:2, length.out = 100)),
  date = seq(Sys.Date(), Sys.Date() + 1000, length.out = 100)
) %>%
  group_by(id) %>%
  mutate(
    total_purchases = cumsum(rpois(n(), lambda = 5)) 
  )

When using calculating regression in rolling window make sure that you handle errors which comming from insufficient degrees of freedom, singularity etc. - that is why I've put tryCatch around rlm call - if there is any error, function returns NA for failing window. Data below is grouped by id which means that model is calculated per customer. Yearly rolling regression should converge to the slope = 5 (+/- random error).


  customers %>%
    group_by(id) %>%
    mutate(
      slope = runner(
        x = .,
        f = function(x) {
          tryCatch(
            rlm(x$total_purchases ~ seq_len(nrow(x)))$coefficients[2],
            error = function(e) NA
          )
        },
        idx = "date",
        k = "year"
      )
    )

Plotting slope in time for customers

ggplot(customers, aes(x = date, y = slope, color = id, group = id)) +
  geom_line() +
  geom_hline(yintercept = 5, color = "red")

enter image description here

GoGonzo
  • 2,637
  • 1
  • 18
  • 25