I have a dataset which has 3 collinear predictors. I end up extracting these predictors and use a principal component analysis to reduce multi-collinearity. What I want is to use these predictors for further modelling.
- Is it incorrect to use the
predict
function and get the values for the 3 collinear predictors and use the predicted values for further analysis? - Or since the first two axes capture the majority of variance (70% in the demo dataset and 96% in the actual dataset) Should I use only the values from the first two axes instead of the 3 predicted values for further analysis?
#Creating sample dataset
df<- data.frame(ani_id = as.factor(1:10), var1 = rnorm(500), var2=rnorm(500),var3=rnorm(500))
### Principal Component Analysis
myPCA1 = prcomp(df[,-1],data = df , scale. = TRUE, center = TRUE)
summary(myPCA1)
This was my result from the demo dataset when I ran
> summary(myPCA1)
Importance of components:
PC1 PC2 PC3
Standard deviation 1.0355 1.0030 0.9601
Proportion of Variance 0.3574 0.3353 0.3073
Cumulative Proportion 0.3574 0.6927 1.0000
This shows that the first two axes captures almost 70% variance.
Now is it correct to do the following?
## Using predict function to predict the values of the 3 collinear predictors
axes1 <- predict(myPCA1, newdata = df)
head(axes1)
subset1 <- cbind(df, axes1)
names(subset1)
### Removing the actual 3 collinear predictors and getting a dataset with the ID and 3 predictors who are no long collinear
subset1<- subset1[,-c(2:4)]
summary(subset1)
## Merge this to the actual dataset to use for further analysis in linear mixed effect models
Thanks for helping! :)
PS- I did read https://stats.stackexchange.com/questions/72839/how-to-use-r-prcomp-results-for-prediction/72847#72847
But was still unsure. Which is why I am asking here.