0

I have a dataframe where I want to predict all variables from the other variables, so I construct a loop like this one:

df = iris
df$Species <- NULL

mods = list()
for (i in 1:ncol(df)) {
  mods[[i]] <- lm(df[, i] ~ ., df)
}

But, to my surprise, each variable appears as it's own predictor; even if I do:

mods = list()
for (i in 1:ncol(df)) {
  mods[[i]] = lm(df[, i] ~ . - df[, i], df)
}

The same happens.

I know I can create the correct formula expression on the side with the proper names and so on, but I feel like this shouldn't be the desired behaviour for lm.

The question is: Am I missing something? Is there a reason why this function has such an uncomfortable behaviour? In case the answer to the previous questions is "no", shouldn't it be improved?

camille
  • 16,432
  • 18
  • 38
  • 60
Eudald
  • 358
  • 3
  • 12
  • Is there a question here? – Kent Johnson Jan 12 '20 at 15:50
  • I thought it was implicit, but I have edited to make it clearer... – Eudald Jan 12 '20 at 15:55
  • Yes, thanks, I know I can do that, but this is very awkward and I don't understand why the behaviour changes from giving an explicit name or accessing via column index. – Eudald Jan 12 '20 at 16:10
  • 1
    df[, i] is just a vector, no index or name, so probably not that easy to check for equivalence with values on the rhs of the formula without comparing values explicitly with those on the rhs. – user20650 Jan 12 '20 at 16:32

2 Answers2

2

This seems expected and very much in line with how R operates to me. You are passing df into the data argument, but then referencing a different df in your formula (it is the same one, but a different object reference at this point.

In your first example, your y variable is not from data, it is from that other df. So therefore there is no data column and the . returns all.

In your second example, you are saying to include all variables from data but exclude a column from some other data frame df. So it excludes that column from df but still is left with all the columns from data.

I think this is what you are expecting:

mods = list()
for (i in 1:ncol(df)) {
  mods[[i]] = lm(df[, i] ~ ., df[, -i])
}
  • Ok, makes sense; shouldn't I get an error then when subtracting a column that wasn't there in the first place? – Eudald Jan 12 '20 at 16:41
  • 1
    You are not subtracting a column that wan't there in the first place...despite pulling the same data from the same 'file' it is looking at two different instances of the data frame, two separate objects. – sconfluentus Jan 12 '20 at 16:44
  • Yes, you can mix objects in `lm` as long as they have the same number of rows. –  Jan 12 '20 at 16:46
2

It excludes names but that code does not use any.

df = iris
df$Species <- NULL

LM <- function(nm) {
  fo <- paste(nm, "~.")
  do.call("lm", list(fo, quote(df)))
}
Map(LM, names(df))

giving this 4 element list (only first shown):

$Sepal.Length

Call:
lm(formula = "Sepal.Length ~.", data = df)

Coefficients:
 (Intercept)   Sepal.Width  Petal.Length   Petal.Width  
      1.8560        0.6508        0.7091       -0.5565  

## ..snip...
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341