2

I have a huge dataset and would like to perform a rolling linear regression over a window of 60. However, I want that only the 60 previous values are considered for the linear regression.

My Dataframe DF consists of following Columns:

Date          Company   Y     X1   X2
01.01.2015    Mill     0.13   -1    -3
01.02.2015    Mill     0.16   1    5 
01.03.2015    Mill     0.83   3    4
01.04.2015    Mill     -0.83  23   4
01.01.1988    Hall    0.23    1    3
01.02.1988    Hall    0.24    23   2
01.03.1988    Hall    0.78    19   -9
01.04.1988    Hall    0.73    4    12
01.05.1988    Hall    0.72    5    12
01.11.2008    Jopo    0.12    0.9  32
01.12.2008    Jopo    0.13    10   32
01.01.2009    Jopo    0.32    0.2  10
01.02.2009    Jopo    0.32    2    -1

I have several thousand companies and data for several months for each company. The regression has to be done for every month of a company, with the rolling window of 60 previous months of this specific company.

In the given example, assuming only a rolling window of 3, I want for company Mill a regression on 01.04.2015 with the data from 01.01-01.03-2015. For company Hall I want regressions on 01.04 and 01.05.1988, and for Jopo I want a regression on 01.02.2009.

Ideally, the results will be pasted together with Company and Date in a new data frame, as I have to keep working with this data and have to analyse it more.

Following code should do the trick for the rolling regression, however it does not use the previous 60 dates, but 59 and includes the current date too:

library(zoo)

rolled <- function(df) {                                    
    rollapply(df, width = 60,
        FUN = function(z) coef(lm(Y ~ X1+X2, data = as.data.frame(z))),
        by.column = FALSE, align = "right"
)
}    

Following code does the regression dependent on the company name, as I want to make regressions for each individual company, independend from the other companies.

Test <- do.call("rbind", by(DF[c("Y", "X1", "X2")], DF[c( "Name")], rolled))

How do I incorporate, that only the 60 previous values are used for the regression? And maybe someone knows how to show also "Company" and "Date" in the results? Thanks for your help!

Henky
  • 69
  • 7
  • What do you mean by the "current observation"? Do you mean the observation that would be for March 2017? Also, can you provide a couple of rows of your data to see what its structure is like? (you can use fake names for the companies). – Phil Mar 03 '17 at 17:37
  • Thanks for the reply. I edited the question and hope it is clearer now. – Henky Mar 03 '17 at 18:06

1 Answers1

1

Assume DF is as given reproducibly in the Note at the end. Use by to split DF into company rows and apply the anonymous function using rollapplyr. Note that rollapplyr can take for the width a list argument with the offsets of the positions to use. For example, list(-seq(3)) means use the 3 prior rows (as suggested in the question) but not the current row (which would have position 0).

library(zoo)

# w <- 60    
w <- 3
Coef <- function(x) coef(lm(as.data.frame(x)))
do.call("rbind", by(DF, DF$Company, function(x) 
    cbind(x, rollapplyr(x[3:5], list(-seq(w)), Coef, fill = NA, by.column = FALSE))))

giving:

              Date Company     Y   X1 X2 (Intercept)         X1         X2
Hall.5  01.01.1988    Hall  0.23  1.0  3          NA         NA         NA
Hall.6  01.02.1988    Hall  0.24 23.0  2          NA         NA         NA
Hall.7  01.03.1988    Hall  0.78 19.0 -9          NA         NA         NA
Hall.8  01.04.1988    Hall  0.73  4.0 12     0.37711 -0.0017480 -0.0484553
Hall.9  01.05.1988    Hall  0.72  5.0 12     1.30333 -0.0433333 -0.0333333
Jopo.10 01.11.2008    Jopo  0.12  0.9 32          NA         NA         NA
Jopo.11 01.12.2008    Jopo  0.13 10.0 32          NA         NA         NA
Jopo.12 01.01.2009    Jopo  0.32  0.2 10          NA         NA         NA
Jopo.13 01.02.2009    Jopo  0.32  2.0 -1     0.41104  0.0010989 -0.0091259
Mill.1  01.01.2015    Mill  0.13 -1.0 -3          NA         NA         NA
Mill.2  01.02.2015    Mill  0.16  1.0  5          NA         NA         NA
Mill.3  01.03.2015    Mill  0.83  3.0  4          NA         NA         NA
Mill.4  01.04.2015    Mill -0.83 23.0  4     0.21611  0.2994444 -0.0711111

You could also try this:

library(broom)
fun <- function(x) unlist(tidy(lm(as.data.frame(x)))[, -1]) 
do.call("rbind", by(DF, DF$Company, function(x) 
 cbind(x, rollapplyr(x[3:5], list(-(seq(w))), fun, fill = NA, by.column = FALSE))))

which gives:

              Date Company     Y   X1 X2 estimate1    estimate2    estimate3
Hall.5  01.01.1988    Hall  0.23  1.0  3        NA           NA           NA
Hall.6  01.02.1988    Hall  0.24 23.0  2        NA           NA           NA
Hall.7  01.03.1988    Hall  0.78 19.0 -9        NA           NA           NA
Hall.8  01.04.1988    Hall  0.73  4.0 12 0.3771138 -0.001747967 -0.048455285
Hall.9  01.05.1988    Hall  0.72  5.0 12 1.3033333 -0.043333333 -0.033333333
Jopo.10 01.11.2008    Jopo  0.12  0.9 32        NA           NA           NA
Jopo.11 01.12.2008    Jopo  0.13 10.0 32        NA           NA           NA
Jopo.12 01.01.2009    Jopo  0.32  0.2 10        NA           NA           NA
Jopo.13 01.02.2009    Jopo  0.32  2.0 -1 0.4110390  0.001098901 -0.009125874
Mill.1  01.01.2015    Mill  0.13 -1.0 -3        NA           NA           NA
Mill.2  01.02.2015    Mill  0.16  1.0  5        NA           NA           NA
Mill.3  01.03.2015    Mill  0.83  3.0  4        NA           NA           NA
Mill.4  01.04.2015    Mill -0.83 23.0  4 0.2161111  0.299444444 -0.071111111
        std.error1 std.error2 std.error3 statistic1 statistic2 statistic3
Hall.5          NA         NA         NA         NA         NA         NA
Hall.6          NA         NA         NA         NA         NA         NA
Hall.7          NA         NA         NA         NA         NA         NA
Hall.8         NaN        NaN        NaN        NaN        NaN        NaN
Hall.9         NaN        NaN        NaN        NaN        NaN        NaN
Jopo.10         NA         NA         NA         NA         NA         NA
Jopo.11         NA         NA         NA         NA         NA         NA
Jopo.12         NA         NA         NA         NA         NA         NA
Jopo.13        NaN        NaN        NaN        NaN        NaN        NaN
Mill.1          NA         NA         NA         NA         NA         NA
Mill.2          NA         NA         NA         NA         NA         NA
Mill.3          NA         NA         NA         NA         NA         NA
Mill.4         NaN        NaN        NaN        NaN        NaN        NaN
        p.value1 p.value2 p.value3
Hall.5        NA       NA       NA
Hall.6        NA       NA       NA
Hall.7        NA       NA       NA
Hall.8       NaN      NaN      NaN
Hall.9       NaN      NaN      NaN
Jopo.10       NA       NA       NA
Jopo.11       NA       NA       NA
Jopo.12       NA       NA       NA
Jopo.13      NaN      NaN      NaN
Mill.1        NA       NA       NA
Mill.2        NA       NA       NA
Mill.3        NA       NA       NA
Mill.4       NaN      NaN      NaN
> 

Alternative

Another possibility is to use a width of w+1 and then remove the last component.

# w <- 60    
w <- 3 
Coef1 <- function(x) coef(lm(as.data.frame(head(x, -1))))
do.call("rbind", by(DF, DF$Company, function(x) 
    cbind(x, rollapplyr(x[3:5], w+1, Coef1, fill = NA, by.column = FALSE))))

Fewer than w+1 rows in a company

If there are companies with fewer than w+1 rows then try this. It uses the partial=TRUE argument of rollapplyr to compute lm with fewer rows and modifies Coef accordingly so that it will continue to work:

# w <- 60    
w <- 3
Coef <- function(x) coef(lm(as.data.frame(matrix(x, c(nrow(x), 1)))))
do.call("rbind", by(DF, DF$Company, function(x) cbind(x, 
  rollapplyr(x[3:5], list(-seq(w)), Coef, partial = TRUE, by.column = FALSE))))

Note: Input DF is:

Lines <- "Date          Company   Y     X1   X2
01.01.2015    Mill     0.13   -1    -3
01.02.2015    Mill     0.16   1    5 
01.03.2015    Mill     0.83   3    4
01.04.2015    Mill     -0.83  23   4
01.01.1988    Hall    0.23    1    3
01.02.1988    Hall    0.24    23   2
01.03.1988    Hall    0.78    19   -9
01.04.1988    Hall    0.73    4    12
01.05.1988    Hall    0.72    5    12
01.11.2008    Jopo    0.12    0.9  32
01.12.2008    Jopo    0.13    10   32
01.01.2009    Jopo    0.32    0.2  10
01.02.2009    Jopo    0.32    2    -1"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Thank you so much! It works perfectly. Little hint for others with the same problem, you have to remove all companies with less than (in my case) 61 datapoints first, otherwise R shows an error. – Henky Mar 04 '17 at 18:25
  • I have an additional question: Do you know, how to get the summary results for every regression? Ideally also pasted in new columns in the data frame. Thank you! – Henky Mar 04 '17 at 19:40
  • Regarding the case where there are fewer than w+1 rows for a company see code added at the end. – G. Grothendieck Mar 04 '17 at 20:18
  • Thanks again, I really appreciate your help!! – Henky Mar 04 '17 at 20:22
  • Sorry, another question: Is there a way to get the t-statistic (significance) for every factor itself in the regression, not for all the factors together (which glance does if i see that right)? In the end i want the beta for every factor and its t-statistic to say something about the significance of every factor for every regression. – Henky Mar 05 '17 at 08:32
  • Thank you soooo much, I'd be lost without you right now! – Henky Mar 05 '17 at 14:51
  • Hi, i have an additional question to your comment/answer with unlist(tidy(lm(...))[, -1]) to get the t-statistics for every factor (from 5th MArch). I have run this and your original code, however I get different values for the coefficients. In your original code/answer (Coef function) I get values for X1, X2, which are different from the corresponding estimates in the code frome 5th March. But shouldn't they be the same as the same regression is run on them? – Henky Mar 17 '17 at 09:45
  • I have transferred it from the comments to the answer writing it out in full. (I have shown it for the first solution so companies with too few rows should be eliminated first.) – G. Grothendieck Mar 17 '17 at 13:43
  • Thank you again!! I found a little silly mistake in my code, that's why I had different estimates! So basically your code works perfect!!!! – Henky Mar 17 '17 at 20:26
  • Sorry to bother again. Different data set, same structure as previous dataframe. The Coef and Glance functions work, the third (tidy) doesn't. I get following warning four times: 1: In rbind(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, c(0.371905076087056, : number of columns of result is not a multiple of vector length (arg 73) (rolling window is now 12, dependent variable + 4 independent variables). What could be the reason? – Henky Mar 18 '17 at 16:36
  • You will need to provide a reproduciuble example. – G. Grothendieck Mar 18 '17 at 23:42