2

I am having a data.table with returns on n dates for m securities. I would like to do a multiple linear regression in the form of lm(ReturnSec1 ~ ReturnSec2 + ReturnSec3 + ... + ReturnSecM). The problem that I am having is that there might be dates missing for some of the securities and the regression should be on aligned dates. Here is what I have come up with so far:

#The data set
set.seed(1)
dtData <- data.table(SecId = rep(c(1,2,3), each= 4), Date = as.Date(c(1,2,3,5,1,2,4,5,1,2,4,5)), Return = round(rnorm(12),2))

#My solution so far
dtDataAligned <- merge(dtData[SecId == 1,list(Date, Return)], dtData[SecId == 2, list(Date, Return)], by='Date', all=TRUE)
dtDataAligned <- merge(dtDataAligned, dtData[SecId == 3,list(Date, Return)], by='Date',  all=TRUE)
setnames(dtDataAligned, c('Date', 'Sec1', 'Sec2', 'Sec3'))
dtDataAligned[is.na(dtDataAligned)] <- 0

#This is what i want to do
fit <- lm(dtDataAligned[, Sec1] ~ dtDataAligned[, Sec2] + dtDataAligned[, Sec3])

Is there a better (more elegant, possibly faster) way of doing this without having to loop and merge the data.table to perform a regression on the values with aligned dates?

Wolfgang Wu
  • 834
  • 6
  • 16

2 Answers2

3

If the question is how to reproduce the output from the code shown in the question in a more compact fashion then try this:

library(zoo)

z <- read.zoo(dtData, split = 1, index = 2)
z0 <- na.fill(z, fill = 0)
lm(`1` ~., z0)

ADDED

Regarding the comment about elegance we could create a magrittr package pipeline out of the above like this:

library(magrittr)

dtData %>% 
    read.zoo(split = 1, index = 2) %>%
    na.fill(fill = 0) %>%
    lm(formula = `1` ~.)
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • +1 for `.` in `lm`, keep forgetting that is an option. – BrodieG Apr 23 '14 at 12:50
  • Thanks for the answer. This is exactly what I intended to do. Sorry for being unclear about the loop. In my example there are only 3 securities, however in my problem I have many more, therefore requiring a loop to merge them together! – Wolfgang Wu Apr 23 '14 at 14:17
3

Here is a data.table solution using dcast.data.table, which takes data in the long format (your input) and converts it to the wide format required for the lm call.

lm(`1` ~ ., dcast.data.table(dtData, Date ~ SecId, fill=0))

Here is the output of the dcast call:

         Date     1     2     3
1: 2014-01-02 -0.63  0.33  0.58
2: 2014-01-03  0.18 -0.82 -0.31
3: 2014-01-04 -0.84  0.00  0.00
4: 2014-01-05  0.00  0.49  1.51
5: 2014-01-06  1.60  0.74  0.39

I stole the lm piece from @G.Grothendieck. Note that if you have more than three columns in your real data you will need to specify the value.var parameter for dcast.data.table.

BrodieG
  • 51,669
  • 9
  • 93
  • 146
  • Great stuff! I didn't know about the dcast function. I just updated my data.table library and this works perfect. – Wolfgang Wu Apr 23 '14 at 14:18