In the linear model y = a_0 + (a_1 × x_1 ) + (a_2 × x_2 ) + (a_3 × x_i ) + ϵ , what value for i∈[3,4,…,100] results in the model with the highest R-Squared?
Given the CSV file with one dependent and 100 independent variables.
In the linear model y = a_0 + (a_1 × x_1 ) + (a_2 × x_2 ) + (a_3 × x_i ) + ϵ , what value for i∈[3,4,…,100] results in the model with the highest R-Squared?
Given the CSV file with one dependent and 100 independent variables.
This question does not make a lot of sense.
Let's take a look at a definition of the coefficient of determination (i.e. "R squared"):
R^2 = 1 - sum(e_i) / ((n - 1) * s^2)
where sum(e_i)
is the sum of squared residuals, and s^2
is the sample variance.
Adding more and more predictors will potentially reduce the sum of squared residuals, but give poor predictive performance due to overfitting.
So the critical question here would be: Which features (variables) are important for your best model with a strong predictive performance.
This question would go way beyond SO (or any other forum), and I recommend a (any) textbook on statistical modelling.