Dynamically reference column name in multiple linear regression (lm())

Question

I apologize if this question was poorly worded, but after hours of searching the web I feel confident in saying this question has not been answered previously. I will do my best to describe in detail exactly what this problem entails.

Data-set summary: The data being used is financial data (Open, High, Low, Close) that was retrieved from python code and stored within individual CSV documents. Using lapply, the documents were then read and stored. To keep things simple, all I am focusing on currently is daily percentage change, or (Close/shift(Close))-1. For purposes of this problem, I have removed all NAs as well as non-complete tickers from the data.

I have a data frame (converted from list) of 98 columns (the tickers), spanning 1000 rows (the days). The values within the data frame/ matrix are the daily percentage changes for each ticker, on each day.

Objective: I want to know how to apply the lm() formula over each column through dynamically referencing the column name, using ALL other columns (~ .).

Sample data set:

aapl_pct_chg <- c(.02, .03, .01, -.05, -.01)
tmus_pct_chg <- c(-.01, -.02, .05, .01, -.03)
akam_pct_chg <- c(.1, -.2, .3, -.03, -.07)
intc_pct_chg <- c(.01, .03, .02, .01, .1)
de_pct_chg <- c(-.01, -.05, .05, .1, -.03)

df <- as.data.frame(cbind(aapl_pct_chg, tmus_pct_chg, akam_pct_chg, intc_pct_chg, de_pct_chg))

names(df) <- c("AAPL", "TMUS", "AKAM", "INTC", "DE")

It is simple enough to do the following:

lm_aapl <- lm(AAPL ~ ., data=df)

But I have been unable to find a way to DYNAMICALLY reference the column name without running into errors. What I mean by this is that, ideally, I could run one formula that will capture the lm() model on each column, using every other column.

There are some answered questions that have HELPED (and I apologize, I am unorganized and have tried this in 500 different ways), but none that have solved it. The closest I have come is a formula that does what I want, but it will include AAPL's values when predicting AAPL -- which leads to a good model but not what I want.

score 2 · Answer 1 · answered Dec 23 '17 at 04:03

Since you can use . in a model formula to represent all remaining variables, you can easily construct a vector of formulas as strings with paste. The usual next step is to iterate across it with lapply or similar, calling as.formula (which is not vectorized) on the string and then applying the formula. All together,

df <- data.frame(AAPL = c(0.02, 0.03, 0.01, -0.05, -0.01), 
                 TMUS = c(-0.01, -0.02, 0.05, 0.01, -0.03), 
                 AKAM = c(0.1, -0.2, 0.3, -0.03, -0.07), 
                 INTC = c(0.01, 0.03, 0.02, 0.01, 0.1), 
                 DE = c(-0.01, -0.05, 0.05, 0.1, -0.03))

models <- lapply(paste(names(df), '~ .'), 
                 function(f){ lm(as.formula(f), data = df) })

models[[1]]
#> 
#> Call:
#> lm(formula = as.formula(f), data = df)
#> 
#> Coefficients:
#> (Intercept)         TMUS         AKAM         INTC           DE  
#>     0.01941      0.52529      0.02116     -0.33372     -0.70687

Note the calls are not very pretty, so if you want to splice in the formula, use substitute and eval the resulting expression:

models <- lapply(paste(names(df), '~ .'), function(f){
    eval(substitute(lm(frm, data = df), 
                    list(frm = as.formula(f))))
})

models[[2]]
#> 
#> Call:
#> lm(formula = TMUS ~ ., data = df)
#> 
#> Coefficients:
#> (Intercept)         AAPL         AKAM         INTC           DE  
#>    -0.03694      1.90370     -0.04028      0.63530      1.34566

This worked perfectly, thank you. For future reference how would you recommend storing the fitted values, coefficients, etc. to enhance accessibility while keeping the code concise? Appreciate your help — ThatsMrLongCut, Dec 23 '17 at 04:55
You can extract them within the anonymous function if you like, but usually it's best to store the list of models so you can pick out what you need afterwards, e.g. with `lapply(models, broom::tidy)` and `lapply(models, broom::glance)` or just `lapply(models, coef)` — alistaire, Dec 23 '17 at 04:59

score 1 · Answer 2 · answered Dec 23 '17 at 03:34

1

You can create a dynamic statement and use eval() and parse() to interpret it

names(df) <- c("AAPL", "TMUS", "AKAM", "INTC", "DE")
for (n in names(df)) {
    code <- paste0("lm_", n , " <- lm(", n, " ~ ., data=df)")
    eval(parse(text=code))
}

answered Dec 23 '17 at 03:34

Patricio Moracho

717
11
15

It's a really bad idea to store code as text (and thus `eval(parse(text = ...))`). Even when operating on the language it's usually better to work with an expression or other language object. – alistaire Dec 23 '17 at 03:58
1

@alistaire I agree, but that's what I came up with. – Patricio Moracho Dec 23 '17 at 04:03
@alistaire can you explain why this solution is bad? – Justapigeon Aug 24 '19 at 23:59
1

@Justapigeon [Here's some answers](https://stackoverflow.com/questions/13649979/what-specifically-are-the-dangers-of-evalparse). Briefly, storing code as strings doesn't scale—you won't get useful errors or aids like syntax highlighting or autocompletion, and so while this example isn't too complicated, this pattern can lead to really convoluted code. More generally, this is an antipattern that repurposes tools for [operating on the language](http://adv-r.had.co.nz/Computing-on-the-language.html) instead of using an approach more idiomatic to R. – alistaire Aug 26 '19 at 02:03
@alistaire I see your point. Maybe this is a special case where this is more readable than the alternative, provided you use `lm()` correctly. – Justapigeon Aug 26 '19 at 17:04

score 1 · Answer 3 · edited Dec 28 '19 at 20:27

1

I was having the same problem while creating a function to loop through values and run a multinomial regression in R. After looking at the comments and responses here, I put together a hybrid solutions.

for(var in vars){
    form <- paste0("Group ~ ", var)
    MLR <- multinom(as.formula(form), data = ModelData)
    print(summary(MLR))

edited Dec 28 '19 at 20:27

Pierre.Vriens

2,117
75
29
42

answered Dec 28 '19 at 12:24

San Emmanuel James

599
4
9

Dynamically reference column name in multiple linear regression (lm())

3 Answers3