0

I am working on a project using R to select a best fitted model.

I have 15 variables and the sample size is 790,000. Linear model does not work b/c the residuals are not random and non-normal.

So I tried to run nonlinear model with higher polynomial and interaction. However, R is extremely slow and shuts down from time to time due to the large dataset.

I tried using the stepwise function, polym function, but neither were ideal. Is there a function/package for high order polynomial and interaction? If I were to write a loop, how would I check normality and randomness of residuals for each scenario without looking at the plot? (Sharpe test doesn't work b/c large sample size). Thank you so much!

Update: fit2b <- lm(f$Assets ~ polym(f$C,f$Suc,f$SP,f$SS, f$Qual_P, f$A, f$TotalAA, f$Eq,f$D, f$PE, f$EI, f$GE, f$EO, degree = 5, raw=TRUE) + f$Gender + f$LT)

fit1b = lm(f$Assets ~ f$A)

step(fit1b, scope = list( upper=fit2b, lower=~1 ), direction = "forward", trace=FALSE)

Also, I am wondering if there's any other tools to detect multicollinearity besides vif and how should I adjust the model to address it.

Community
  • 1
  • 1
Shiring
  • 1
  • 2
  • It shouldn't be slow because the model estimates the parameters mathimagically. Without seeing your code though, I have no idea. Further, with that many observations, your model is going to return confidence intervals that are unrealistically tight (it thinks it has over 700K degrees of freedom) I recommend running several models on randomly selected subsets (say 10% each) and comparing the results. Should fix both problems. – Bryan Goggin Jun 23 '16 at 15:26
  • I just updated my code. I think my boss is only looking for the point estimate thus using subsets didn't come to me. Moreover, the model is not linear (with another model that has 7 degree polynomial and 4 degree interaction). I already separated and filtered the data so that I have 400k as model sample and 250k as validation. Do you think I still should use subset? Plus, I don't really know how to select the best model with interaction/polynomial as we've only done linear regression projects before. – Shiring Jun 23 '16 at 18:45
  • Even though you have polynomials, the model is still linearly additive-- that is, your response is modeled as a linear combination of the explanatory variables. All of the usual linear model theory still applies. See: https://en.wikipedia.org/wiki/Linear_regression for more info. – Bryan Goggin Jun 23 '16 at 23:19

0 Answers0