26

There are many redundant, and sometimes conflicting, ways of specifying formulae in R. Is there a comprehensive yet concise reference for mapping a conceptual models to R syntax than ?formula?

I am interested in a broad overview, including the syntax used to specify formulas in non-linear and hierarchical models such as glm, lmer, gam, earth, including (/) for nesting, random and fixed effects in mixed models, and s and te for splines, and others found in popular contributed packages.

Abe
  • 12,956
  • 12
  • 51
  • 72

1 Answers1

25

R comes with several manuals, which are accessible from vanilla R's "Help" menu at the top right when running R and are also in several places on-line.

Chapter 11 of "An Introduction to R" has a couple of pages on formulas, for example.

I don't know that it constitutes a "comprehensive" resource but it covers much* of what you need to know about how formulas work.

* Indeed, pretty much all of what perhaps 95% of users will ever use

The canonical reference to formulas in the S language might be

Chambers J.M., and Hastie T.J., eds. (1992), Statistical Models in S. Chapman & Hall, London.

though the origin of the approach comes from

Wilkinson G.N., and Rogers C.E. (1973). "Symbolic Description of Factorial Models for Analysis of Variance." Applied Statistics, 22, 392–399

A number of recent books related to R discuss formulas but I don't know that I'd call any of them comprehensive.

There are also numerous on-line resources (for example here) often with a good deal of very useful information.

That said, once you get comfortable with using formulas in R and so have a context into which more knowledge can be placed, the help page contains a surprising amount of information (along with other pages it links to). It is a bit terse and cryptic, but once you have the broader base of knowledge of R's particular way of working, it can be quite useful.

Specific questions relating to R formulas (depending on their content) are likely to be on topic either at StackOverflow or at CrossValidated - indeed there are some quite advanced questions relating to formulas to be found already (use of searches like [r] formula might be fruitful), and it would be handy to have more such questions to help users struggling with these issues; if you have specific questions I'd encourage you to ask.

As for 'redundant' and 'conflicting', I suppose you mean things like the fact that there is more than one way to specify a no-intercept model : y ~ . -1 and y ~ . +0 both work, for example, but in slightly different contexts each makes sense.

In addition, there's the common bugbear of having to isolate quadratic and higher order terms from the formula interface (to use I(x^2) as a predictor so it's passed through the formula interface unharmed and survives far enough to be interpreted as an algebraic expression). Again, once you get a picture of what's going on 'behind the scenes' that seems much less of a nuisance.

Specific examples of the things I just mentioned:

lm(dist ~ . -1, data=cars) # "remove-intercept-term" form of no-intercept
lm(dist ~ . +0, data=cars) # "make-intercept-zero" form of no-intercept
lm(dist ~ speed + speed^2, data=cars) # doesn't do what we want here
lm(dist ~ speed + I(speed^2), data=cars) # gets us a quadratic term
lm(dist ~ poly(speed,2), data=cars) # avoid potential multicollinearity

I agree that the formula interface could at least use a little further guidance and better examples in the ?formula help.

Glen_b
  • 7,883
  • 2
  • 37
  • 48
  • Thank you very much for this helpful answer. I am interested in a more broad overview, was interested in include the specification of nested variables and fixed vs. random effects. And, aren't `x + I(x^2)` and `poly(x, 2)` equivalent? Your answer suggests otherwise. Other aspects of interest include specifying spline functions in functions such as `gam` (e.g. with `s` and `te`). – Abe May 01 '13 at 17:22
  • In reference to the second-to last paragraph (about the bugbear) the use of `I` is not limited to specification of polynomial terms - it is also required to specify other variable transformations (including additive, multiplicative, log, exponential). – Abe May 01 '13 at 17:33
  • That's correct for additive and multiplicative, because they have meaning to the formula interface. but `lm(dist~log(speed),data=cars)` works as it should. – Glen_b May 01 '13 at 22:35
  • "*Your answer suggests otherwise.*" - well, (i) I really don't think it does, since the `#`-comment explains exactly why that one is there, and that comment isn't about a *different model*, and (ii) they're not the same coefficients (since `poly` uses an orthogonal transformation of (1,x,x^2)), and when multicollinearity is a serious problem not necessarily even an identical fit. I can delete that line if you'd prefer, but I think when people are considering fitting polynomials they should be thinking about using it, so since I mentioned polynomials I felt it was important to point to `poly`. – Glen_b May 01 '13 at 22:42
  • 2
    People should be severely counseled against using `X+I(X^2)`. They should instead be offered `poly(X,2)`. Failing to do so deprives them of a safe statistical passage into polynomial inference. – IRTFM May 17 '13 at 01:12