How to regress multiple series on single series conditioned on grouping variables?

Question

I feel that my basic problem is how to regress multiple series on a single series. Although my series are not equal in time, even when I use equal time length series for stock and benchmark(I can provide data I made equal manually if needed) I get an error. I want to estimate a market model (that is, regressing a stock's returns on benchmark's return for each day, for all stocks) and to make a data frame of beta values from the regression in long format. So, for the sample provided, there will be 4 beta values(2 for ABC and 2 for XYZ) in the beta value data frame. This is a sample of two stock prices

idf <- structure(list(Firm = c("ABC", "ABC", "ABC", "ABC", "ABC", "ABC", "ABC",
  "ABC", "ABC", "ABC", "ABC", "ABC", "ABC", "ABC", "ABC", "XYZ", "XYZ", "XYZ",
  "XYZ", "XYZ", "XYZ", "XYZ", "XYZ", "XYZ", "XYZ", "XYZ", "XYZ", "XYZ", "XYZ",
  "XYZ"), Date = structure(c(NA, 1451642400, 1451646000, 1451649600, 1451653200,
  1451656800, 1451660400, 1451664000, 1451898000, 1451901600, 1451905200,
  1451908800, 1451912400, 1451916000, 1451919600, NA, 1451642400, 1451646000,
  1451649600, 1451653200, 1451656800, 1451660400, 1451664000, 1451898000,
  1451901600, 1451905200, 1451908800, 1451912400, 1451916000, 1451919600),
  tzone = "UTC", class = c("POSIXct", "POSIXt")), Price = c(1270.9, 1277,
  1273.25, 1273.85, 1273.75, 1272, 1265.35, 1265.35, 1253.1, 1248.1, 1242,
  1248.15, 1241.1, 1246.5, 1242.5, 225.75, 225.7, 225.5, 225.45, 228.6, 227.7,
  227.8, 227.8, 226, 225.1, 222.35, 222.25, 221.1, 221.2, 220.7), rt = c(NA,
  0.00478826614451489, -0.0029408902678254, 0.000471124032113579,
  -7.8505259892836e-05, -0.00137484063686699, -0.00524170116535849, 0,
  -0.00972828256347036, -0.00399808624683118, -0.00489941143098349,
  0.00493947152112462, -0.00566437187907365, 0.00434154082813709, 
  -0.00321414499282824, NA, -0.000221508473604359, -0.000886524880757023, 
  -0.000221754075639957, 0.0138753464936165, -0.00394477829101625, 
  0.000439077943387822, 0, -0.00793305174079517, -0.00399025135959974, 
  -0.0122920309559813, -0.000449842562691316, -0.00518778653061336, 
  0.000452181784779349, -0.00226295638548901), day = structure(c(NA, 16801,
  16801, 16801, 16801, 16801, 16801, 16801, 16804, 16804, 16804, 16804, 16804,
  16804, 16804, NA, 16801, 16801, 16801, 16801, 16801, 16801, 16801, 16804,
  16804, 16804, 16804, 16804, 16804, 16804), class = "Date")),
  .Names = c("Firm", "Date", "Price", "rt", "day"),
  row.names = c(NA, -30L), class = c("tbl_df", "data.frame"))

and a sample of benchmark index

imdf <- structure(list(Date = structure(c(NA, 1451642400, 1451646000, 1451649600,
  1451653200, 1451656800, 1451660400, 1451664000, 1451898000, 1451901600,
  1451905200, 1451908800, 1451912400, 1451916000, 1451919600, 1451923200),
  class = c("POSIXct", "POSIXt"), tzone = "UTC"), Price = c(3443.1, 3450.5,
  3453.85, 3447.9, 3456.9, 3468.45, 3472.1, 3472.1, 3484.75, 3466.45, 3417.5,
  3416.05, 3401, 3425.75, 3425.25, 3425.25), rt = c(NA, 0.0021469197059254, 
  0.000970402793278424, -0.00172420081111291, 0.00260688364525663, 
  0.00333557458000655, 0.00105178994070698, 0, 0.0036367074012027, 
  -0.00526529010184795, -0.0142217259100423, -0.000424376794422088, 
  -0.00441540679648789, 0.00725091981911596, -0.000145964093093198, 
  0), day = structure(c(NA, 16801, 16801, 16801, 16801, 16801, 16801, 16801,
  16804, 16804, 16804, 16804, 16804, 16804, 16804, 16804), class = "Date")),
  .Names = c("Date", "Price", "rt", "day"), row.names = c(NA, -16L),
  class = c("tbl_df", "tbl", "data.frame"))

Following is my dplyr based code for estimating current beta for stocks ABC and XYZ for two days by regressing their intraday return on benchmark return.

require(dplyr)
q3 <- idf %>%
  group_by(Firm, day) %>%
  summarise(Beta = PerformanceAnalytics::CAPM.beta(idf$rt, imdf$rt)) %>%
  na.omit()

But it gives the following error:

Error: The data cannot be converted into a time series. If you are trying to pass in names from a data object with one column, you should use the form 'data[rows, columns, drop = FALSE]'. Rownames should have standard date formats, such as '1985-03-15'.

For

q3 <- idf %>%
 group_by(Firm, day) %>%
 summarise(Beta = coef(summary(lm(idf$rt~imdf$rt)))[2,1]) %>%
 na.omit()

I get the following error:

Error: error in evaluating the argument 'object' in selecting a method for function 'summary': Error in model.frame.default(formula = idf$rt ~ imdf$rt, drop.unused.levels = TRUE) : variable lengths differ (found for 'imdf$rt')`

I also tried the following based on aggregate function:

beta.est= function(x) coef(summary(lm(x~imdf$rt)))[2,1]
betas = aggregate(cbind(rt) ~   Firm + day , idf, FUN = beta.est)

But it also gives same error:

Error in summary(lm(x ~ imdf$rt)) : error in evaluating the argument 'object' in selecting a method for function 'summary': Error in model.frame.default(formula = x ~ imdf$rt, drop.unused.levels = TRUE) :    variable lengths differ (found for 'imdf$rt')

Then I tried a 'for loop':

betas=list()
for(i in idf$Firm ){
 for(j in  idf$day){
  betas$day= coef(summary(lm(idf[i,4]|j~imdf$rt|j)))[2,1]}}

Again, I get this:

Error in summary(lm(idf[i, 4] | j ~ imdf$rt | j)) : error in evaluating the argument 'object' in selecting a method for function 'summary': Error in model.frame.default(formula = idf[i, 4] | j ~ imdf$rt | j, drop.unused.levels = TRUE) : variable lengths differ (found for 'imdf$rt | j')

When I use equal time length series for stock and benchmark I get the following error:

Error in summary(lm(idf[i, 4] | j ~ imdf$rt | j)) : error in evaluating the argument 'object' in selecting a method for function 'summary': Error in model.frame.default(formula = idf[i, 4] | j ~ imdf$rt | j, drop.unused.levels = TRUE) : variable lengths differ (found for 'imdf$rt | j')

Although the above codes are for estimating current beta values, it would be nice if I can also regress lag of benchmark index on the stocks. Kindly help me.

Your variables length are shorter when aggregated by `Firm` and `day` than `imd$rt`- hence the main issue here is with the `lm` function. Otherwise, simply `idf %>% group_by(Firm, day) %>% summarise_each(funs(beta.est))` would have worked. — David Arenburg, Jun 22 '16 at 18:25
A join maybe `idf %>% group_by(Firm, day) %>% left_join(imdf, by = "day") %>% summarise(Beta = coef(summary(lm(.$rt.x ~ .$rt.y)))[2,1])`? Still not sure what are you trying to do — David Arenburg, Jun 22 '16 at 18:40
Okay @DavidArenburg . step by step. returns of ABC is dependent variable and returns of Bench is independent variable on day1. returns of ABC is dependent variable and returns of Bench is independent variable on day 2. returns of XYZ is dependent variable and returns of Bench is independent variable on day1. returns of XYZ is dependent variable and returns of Bench is independent variable on day 2. Have I made myself clear? — Polar Bear, Jun 22 '16 at 18:52
@DavidArenburg I have tried what you have posted but it gives same beta values for all days: structure(list(Firm = c("ABC", "ABC", "ABC", "XYZ", "XYZ", "XYZ" ), day = structure(c(16801, 16804, NA, 16801, 16804, NA), class = "Date"), Beta = c(0.123471787352368, 0.123471787352368, 0.123471787352368, 0.123471787352368, 0.123471787352368, 0.123471787352368)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), vars = list( Firm), .Names = c("Firm", "day", "Beta"), drop = TRUE) — Polar Bear, Jun 23 '16 at 07:23
@DavidArenburg idf %>% group_by(Firm, day) %>% summarise_each(funs(beta.est)) for equal length vairables also give similar error. — Polar Bear, Jun 23 '16 at 09:17

wici · Answer 1 · 2016-06-30T18:59:07.930

I hope you don't mind that I use some additional packages. I my opinion a combination of dplyr,tidyr,purrr and broom really simplifys working with multiple linear regressions:

To ensure we don't confuse the data column names with the benchmark column names we simply add the benchmarks name:

names(imdf) <- paste("bm1", names(imdf), sep ="_")

You can also add a lagged Date variable and/or join another benchmark

set.seed(123)
imdf$bm1_Date_lag <- imdf$bm1_Date - 3600     # shift Date by 1 hour
imdf$bm2_rt <- rnorm(nrow(imdf), sd = 0.005)  # add a dummy benchmark

First you have to group your data by Firm and day, join it with the benchmark data on Date and nest it, this creates for each group a data.frame called data containing all the other variables.

idf_grouped <- idf %>% group_by(Firm, day) %>%
     left_join(imdf, by = c("Date" = "bm1_Date")) %>% 
     nest()

Remove the groups without day values so that you are able to fit a linear model on each group and add it as a new column with map from purrr. You can do this for multiple models in one step.

idf_grouped <- idf_grouped %>% filter(!is.na(day)) %>%
     mutate(model = map(data,~lm(rt~bm1_rt, data = .)),
            model2 = map(data,~lm(rt~bm2_rt, data = .)))

To extract the coefficients of linear models I prefer tidy from broom because it returns a data.frame which is - in the context of dplyr - easier to handle. unnest() transforms your data back into a single data.frame. Because you aren't interested in the intercept we filter for bm1_rt

idf_grouped %>% 
    unnest(bm1 = model %>% map(tidy, quick = TRUE),
             bm2 = model2 %>% map(tidy, quick = TRUE),.sep = "_") %>%
    filter(bm1_term == "bm1_rt")
##    Firm        day bm1_term bm1_estimate bm2_term bm2_estimate
##   <chr>     <date>   <fctr>        <dbl>   <fctr>        <dbl>
## 1   ABC 2016-01-01   bm1_rt   0.08879514   bm2_rt   -0.2781275
## 2   ABC 2016-01-04   bm1_rt   0.25888489   bm2_rt    0.3845765
## 3   XYZ 2016-01-01   bm1_rt   0.66791986   bm2_rt   -0.2891714
## 4   XYZ 2016-01-04   bm1_rt   0.45735812   bm2_rt   -0.5014824

To compute these betas on lagged benchmarks just join by "Date" = "bm1_Date_lag" during the first step.

Edit: In order to implement an industry return you must have a mapping which industry each firm belongs to. For illustration purpose I simply added another firm "DEF" as a copy of "XYZ" that I map to the same industry as "ABC"

idf_2 <- idf %>% filter(Firm == "XYZ") %>% mutate(Firm = "DEF") %>% bind_rows(idf)
firm_map <- data.frame(Firm = c("ABC", "DEF", "XYZ"), Industry = c(1,1,2), stringsAsFactors = FALSE)

simply join this into idf_2

idf_map <- idf_2 %>% left_join(firm_map, by = "Firm")

and compute eg. each industry's average return

idf_map %>% left_join(idf_map %>% group_by(Industry, Date) %>% 
    summarise(ind_rt = mean(rt, na.rm = TRUE)), by = c("Industry", "Date"))

Now ind_rt can be used as an explanatory variable in the regression.

Thank you @wici, Please also include a lag 1 of the benchmark. some queries: Is rt.x=return on stocks & rt.y=return on benchmark?, Is it possible to first merge the benchmark into idf with benchmark stacked in idf$Firm and dates under idf$Date and then use lm or plm packages?, Is it possible to include more benchmarks (BM1, BM2) into the above process?, — Polar Bear, Jun 24 '16 at 20:56
please answer the queries so that I can accept the answer @wici . — Polar Bear, Jun 28 '16 at 09:43
@PolarBear sorry I was recently a bit occupied, just edited my answer. I'm not sure I understand your second query, for each `Firm` the benchmark is merged on matching `Date` values. You need daily evaluations so you have to split (group) your data accordingly. — wici, Jun 28 '16 at 15:00
Thank you very much. The answer looks very nice. I have one last query if you don't mind: I want to include industry returns (e.g. mining, FMCG) in the regression. I have many firms and some of them will surely belong to the same industry and some will not. This was the reason for the 2nd query in the last comment. I have found this in many articles (e.g. Roll(1988)) but I have no clue how do they do it during estimation. — Polar Bear, Jun 28 '16 at 15:39
Sorry to bother again, Kindly try to answer the last query so I can accept the answer. I am offering you the bounty for the time and effort you gave. Thank you very much. — Polar Bear, Jun 30 '16 at 20:18

How to regress multiple series on single series conditioned on grouping variables?

1 Answers1