6

I tried to run anova on different sets of data and didn't quite know how to do it. I goolged and found this to be useful: https://stats.idre.ucla.edu/r/codefragments/looping_strings/

hsb2 <- read.csv("https://stats.idre.ucla.edu/stat/data/hsb2.csv")
names(hsb2)
varlist <- names(hsb2)[8:11]
models <- lapply(varlist, function(x) {
lm(substitute(read ~ i, list(i = as.name(x))), data = hsb2)
})

My understanding of what the above codes does is it creates a function lm() and apply it to each variable in varlist and it does linear regression on each of them.

So I thought use aov instead of lm would work for me like this:

aov(substitute(read ~ i, list(i = as.name(x))), data = hsb2)

However, I got this error:

Error in terms.default(formula, "Error", data = data) : 
no terms component nor attribute

I have no idea of where the error comes from. Please help!

KT12
  • 549
  • 11
  • 24
olala
  • 4,146
  • 9
  • 34
  • 44
  • 3
    That's a rather complex way to include a variable they suggest. I'd probably just do: `lm(as.formula(paste("read ~",x)), data = hsb2)` – thelatemail Sep 23 '14 at 05:25
  • @RichardScriven: Do you wonder if this is an example of "eternal September" (the posting of undergraduates with modest cluefulness?) – IRTFM Sep 23 '14 at 05:55
  • 1
    @BondedDust I wish I were still a undergraduate.. – olala Sep 23 '14 at 13:38

3 Answers3

6

The problem is that substitute() returns an expression, not a formula. I think @thelatemail's suggestion of

lm(as.formula(paste("read ~",x)), data = hsb2)

is a good work around. Alternatively you could evaluate the expression to get the formula with

models <- lapply(varlist, function(x) {
    aov(eval(substitute(read ~ i, list(i = as.name(x)))), data = hsb2)
})

I guess it depends on what you want to do with the list of models afterward. Doing

models <- lapply(varlist, function(x) {
    eval(bquote(aov(read ~ .(as.name(x)), data = hsb2)))
})

gives a "cleaner" call property for each of the result.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • You shouldn't need both `bquote` and `substitute` - pick one! e.g. `eval(bquote(aov(read ~ .(as.name(x)), data = hsb2)))` – hadley Sep 23 '14 at 17:25
  • Thanks @hadley. Don't know why I was being silly. I've updated based on your comment. – MrFlick Sep 23 '14 at 18:18
  • Neither `substitute` nor `bquote` typically return an expression. Typically they return a call. – IRTFM Sep 23 '14 at 22:03
  • @BondedDust expression is a ill-defined term in R. Expression objects (as returned by `expression()`) are rarely, if ever, needed. – hadley Sep 23 '14 at 22:47
  • There is an `is.expression` function and an 'expression' mode. That should serve as an operational definition. – IRTFM Sep 23 '14 at 22:50
  • @BondedDust All i meant by "expression" was any object that you can hold in an `expression` class object (which acts much list a list). That can be a name, call, or an a atomic type to name a few: ie `lapply(expression(a, print(), 1), class)` – MrFlick Sep 24 '14 at 02:12
5

This should do it. The varlist vector is going to be passed item by item to the function and the column will be delivered. The lm function will only see a two column dataframe and the "read" column will be the dependent variable each time. No need for fancy substitution:

models <- sapply(varlist, function(x) {
lm(read ~ .,  data = hsb2[, c("read", x) ])
}, simplify=FALSE)

> summary(models[[1]])  # The first model. Note the use of "[["

Call:
lm(formula = read ~ ., data = hsb2[, c("read", x)])

Residuals:
     Min       1Q   Median       3Q      Max 
-19.8565  -5.8976  -0.8565   5.5801  24.2703 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 18.16215    3.30716   5.492 1.21e-07 ***
write        0.64553    0.06168  10.465  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 8.248 on 198 degrees of freedom
Multiple R-squared: 0.3561, Adjusted R-squared: 0.3529 
F-statistic: 109.5 on 1 and 198 DF,  p-value: < 2.2e-16 

For all the models::

lapply(models, summary)
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Nice. Using `sapply(...,simplify=FALSE)` instead of `lapply` will also keep the name and allow you to subset like `models$write` – thelatemail Sep 23 '14 at 05:53
  • Clear to stick with `lapply()` and then name the results. – hadley Sep 23 '14 at 17:27
  • thanks for the explanation! I'm not sure which answer to accept... maybe MrFlick's since it's more for my confusion.. you've got a lot of points already :) – olala Sep 23 '14 at 19:50
  • Right. Certainly don't need any more points. MrFlick is a worthy recipeint. @hadley could use `setNames( sapply(...), varlist)` – IRTFM Sep 23 '14 at 21:54
4

akrun borrowed my answer the other night, now I'm (partially) borrowing his.

do.call puts the variables into the call output so it reads properly. Here's a general function for simple regression.

doModel <- function(col1, col2, data = hsb2, FUNC = "lm") 
{
    form <- as.formula(paste(col1, "~", col2))
    do.call(FUNC, list(form, substitute(data)))
}     

lapply(varlist, doModel, col1 = "read")
# [[1]]
#
# Call:
# lm(formula = read ~ write, data = hsb2)
#
# Coefficients:
# (Intercept)        write  
#     18.1622       0.6455  
#
#
# [[2]]
#
# Call:
# lm(formula = read ~ math, data = hsb2)
#
# Coefficients:
# (Intercept)         math  
#     14.0725       0.7248  
#
# ...
# ...
# ...

Note: As thelatemail mentions in his comment

sapply(varlist, doModel, col1 = "read", simplify = FALSE)

will keep the names in the list and also allow list$name subsetting.

Rich Scriven
  • 97,041
  • 11
  • 181
  • 245