2

I have a dataframe train (21 predictors, 1 response, 1012 observations), and I suspect that the response is a nonlinear function of the predictors. Thus, I would like to perform a multivariate polynomial regression of the response on all the predictors, and then try to understand which are the most important terms. To avoid the collinearity problems of standard multivariate polynomial regression, I'd like to use multivariate orthogonal polynomials with polym(). However, I have quite a lot of predictors, and their names do not follow a simple rule. For example, in train I have predictors named X2,X3 and X5, but not X1 and X4. The response is X14. Is there a way to write the formula in lm without having to explicitly write the name of all predictors? Writing

OrthoModel=lm(X14~polym(.,2),data=train)

returns the error

Error in polym(., 2) : object '.' not found

EDIT: the model I wanted to fit contains about 3.5 billion terms, so it's useless. It's better to fit a term with only main effects, interactions and second degree terms -> 231 terms. I wrote the formula for a standard (non-orthogonal) second degree polynomial:

`as.formula(paste(" X14 ~ (", paste0(names(Xtrain), collapse="+"), ")^2", collapse=""))` 

where Xtrain is obtained by train by deleting the response column X14. However, when I try to express the polynomial in an orthogonal basis, I get a parse text error:

    as.formula( 
         paste(" X14 ~ (", paste0(names(Xtrain), collapse="+"), ")^2", "+", 
               paste( "poly(", paste0(names(Xtrain), ", degree=2)", 
                      collapse="+"), 
               collapse="")
     )
     
 )
Nimantha
  • 6,405
  • 6
  • 28
  • 69
DeltaIV
  • 4,773
  • 12
  • 39
  • 86

1 Answers1

2

There are a couple of problems with that approach, one of which you already see but even if the dot could be expanded within polym you would still have faced an error when it came time for the 2 to be evaluated, because degree is a parameter after the "dots" in the polym argument list and it therefore must be supplied as a named parameter rather than just positionally offered.

An approach using as.formula succeeds (with the 'Orthodont' dataframe in pkg:nlme (although using 'Sex' as the dependent variable is statistically nonsense). I took out the "Subject" column from the data and also took out the "Sex" from the names passed to paste:

data(Orthodont, package="nlme")
lm(   as.formula( paste("Sex~polym(" ,
                        paste(names(Orthodont[-(3:4)]), collapse=","),",degree=2)")), 
      data=Orthodont[-3])

Call:
lm(formula = as.formula(paste("Sex~polym(", paste(names(Orthodont[-(3:4)]), 
    collapse = ","), ",degree=2)")), data = Orthodont[-3])

Coefficients:
                        (Intercept)  polym(distance, age, degree = 2)1.0  
                             1.4433                              -2.5849  
polym(distance, age, degree = 2)2.0  polym(distance, age, degree = 2)0.1  
                             0.4651                               1.3353  
polym(distance, age, degree = 2)1.1  polym(distance, age, degree = 2)0.2  
                            -7.6514      

Formula objects can be created from text input with as.formula. This is essentially an application of the last example in ?as.formula.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Looks great! I'll test it tomorrow and accept the answer, if it works. Thank you! BTW, can you explain why `collapse=","` is needed as an argument for `paste`? Also,I've sometimes seen `paste0` used in the context of building a formula, as opposed to `paste`. Any reason why I should prefer a form or the other? – DeltaIV Aug 25 '15 at 21:23
  • 1
    The comma(s) is/are used to separate the `names` given as the argument list of `polym`. It doesn't really matter whether `paste` or `paste0` are used in building formulas unless you are trying to make names that get pasted together with no spaces, say with a stem of "X" and a sequence of numbers. Then you might prefer to use `paste0`. – IRTFM Aug 25 '15 at 21:48
  • Thanks for the answer! It works, at least in theory: in practice I get an out of memory error "cannot allocate vector of size 26.0 Gb". That's reasonable - I was asking for an horrible model with roughly 3.5 billions coefficients!!!!! Nonononono. `polym` is definitely useless here. Must think of something else... – DeltaIV Aug 26 '15 at 09:27
  • I started modifying an answer (by you, what a coincidence!) to an older question of mine and I think I may be on to something. I edited my question since it doesn't fit in the comments. Can you help me getting the last bits right? – DeltaIV Aug 26 '15 at 09:42