3

I've a dataset with 70 variables, and I want to try polynomial regression on it. If the number of columns were three/four I could just hand code something like this --

 model <- lm(y ~ poly(var1,3) + poly(var2,3) + poly(var4,4)

How would we go about this, if we have 70 variables? Should we type in manually names of all the variables or is there a easier method?

smci
  • 32,567
  • 20
  • 113
  • 146
  • do you have any prior knowledge (e.g., from business) about the degree of the polynomial to be fitted for each variable? in general polynomial regression tends to overfit and is less generalizable. – Sandipan Dey Jan 04 '17 at 18:08
  • Here I know what each variable stands for, but have no clue about their degree of polynomial.. simple linear model is giving very poor Rsquared values (around 0.02) And I want to know how we model polynomial regression in general... – tired and bored dev Jan 04 '17 at 18:17
  • 4
    Your first question "How would we go about this, if we have 70 variables?" could be considered a programming question if it means how would we produce this in an automated fashion across many variables. Your second question is off topic on SO and would find a better home on [CV](http://stats.stackexchange.com/). If, as your comment implies, your main question relates to statistical modeling, I would delete the question here, and post a question on CV emphasizing this point. – lmo Jan 04 '17 at 18:23
  • deleted second part of question... – tired and bored dev Jan 04 '17 at 18:31
  • thanks, adibender also suggested the same thing... – tired and bored dev Jan 04 '17 at 19:05

1 Answers1

4

You could paste the formula, if all variables are named systematically:

form <- as.formula(paste("y~", paste0("poly(var", 1:10, ")", collapse="+")))

or (for polynomial of 3rd degree):

form <- as.formula(paste("y~", paste0("poly(var", 1:10, ", degree=3)", collapse="+")))

Also, if you have only the dependent variable y and covariates of interest (that have non-systematic names) in your dataset df, you can try

ind.y <- grep("y", colnames(df))
form <- as.formula(paste("y~", paste0("poly(", colnames(df[, -ind.y]), ", degree=3)", collapse="+")))
adibender
  • 7,288
  • 3
  • 37
  • 41
  • Two points -- 1. Is the degree set to one automatically? 2. We may need differing degrees, isn't it? like poly(var1,3)+poly(var2,1) – tired and bored dev Jan 04 '17 at 18:38
  • 1.) yes, but you can include degree into the pasted formula, see edit – adibender Jan 04 '17 at 18:40
  • 3
    2.) yes, but as indicated in the comments above, you usually don't want to set the degree of polynomial in advance or better yet use global polynomials. Penalized splines are usually preferred, For example using the `gam` function from `mgcv` package. Also you probably want to do variable selection, but as mentioned before, that's a question for Cross Validated – adibender Jan 04 '17 at 18:43
  • Okay. Thanks Learned a lot here. I will check out gam,which might have some answers to my machine-learning related part of question.. – tired and bored dev Jan 04 '17 at 18:47
  • @user1478061 In 2.) I ment to write "better yet **not** use global polynomials at all". – adibender Jan 04 '17 at 18:51
  • Thanks for your thoughts. Actually, I've not yet decided on the approach. I too think global polynomials might be a bad idea, but just want to explore how it goes, to learn more about it... – tired and bored dev Jan 04 '17 at 18:57
  • 1
    My residual vs fitted.plot values looks like third plot (with exponential relationship - http://stats.stackexchange.com/questions/253035/trying-to-understand-the-fitted-vs-residual-plot/253039#253039 reminds of "unreasonable effectiveness of mathematics" argument.. when I did log transformation, Rsquared show a jump, not to a good value, but some improvement... – tired and bored dev Jan 04 '17 at 19:29