Polynomial regression creates 100+ variables from 10: I was expecting 20+constant

Question

I am creating a polynomial regression by using transform variables into polynomial. I am using degree 2. After transformation my variables are becoming more than 100. I was expecting 20+constant (variables and theier 2nd degree powers) Here is code:

from sklearn.preprocessing import PolynomialFeatures
degree = 2
poly = PolynomialFeatures(degree,include_bias=False)
X_poly = poly.fit_transform(X)
X_train_poly, X_test_poly = train_test_split(X_poly, test_size=0.2, random_state=42)
poly_model = sm.OLS(y_train,X_train_poly ).fit()
print(poly_model.summary())

'mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year', 'americas', 'europe', 'asia' last 3 are dummy variables — Amirgiano, Jun 29 '23 at 16:11

score 0 · Answer 1 · answered Jun 29 '23 at 16:02

By default, PolynomialFeatures will produce all possible combinations of features up to the given degree. To reduce the number of variables, you can:

Use the interaction_only argument. When True, this will only produce interaction features (i.e., features that are product of at most degree distinct input features). This will generally produce fewer features than the default setting, however the feature set will still be quite large
Use feature selection techniques. After generating the polynomial features, you can use techniques like Recursive Feature Elimination, SelectKBest, or Lasso regularization to select the most important features and discard the rest.
Use Principal Component Analysis. This technique can reduce the dimensionality of your data by creating new uncorrelated variables that capture the most variance in the data. You can calculate how many components you need to preserve a given amount of variance by using a cumulative sum of the explained variance ratio.

For me, it's always using RFECV with an XGBoost classifier / regressor depending on my problem. This method recursively eliminates features based on a tree based models feature importance. It finds the best subset of features that yield the best metric score.

as I answered also to Harshad . I have tried either to include interaction only (either false or true) and to use include_bias False but still I have interactions. I wan't to have only the variables with degrees. Becasue on literature there is nothing said about interactions between the variables — Amirgiano, Jul 03 '23 at 11:25

Harshad Patil · Answer 2 · 2023-07-03T11:47:04.637

If you use polynomial features with degree of 2 then only from 2 features it can create features like e.g. features: a and b. Then the polynomial features will be: a^2, b^2, 2*a*b and if you have 3 features then it will create a^2, b^2, c^2, 2ab, 2ac, 2bc.

So, now as you have 10 features, then it should create 10! / [2!(10-2)!] = 45 new features. If you are getting more than 45 new features(in total it should be 55, but you will get 45 because you have used include_bias=False).

So, please check shape of the dataset using: X.shape before feeding it to the polynomial function. It should be 10 if you want 45 features, and it will be more than 10 if you are getting more than 100 new features.

To get only the degree columns, as you mentioned in the comments, you can create a new dataframe and put all your new columns in it and use the new dataframe instead of the old one. So in summary, in the new dataframe you will have all the new columns and in the old dataframe you will have only the old columns. Or you can select only those columns from your dataframe that you are interested in, like this: df = df[['a2', 'b2', '2ab']]

I have tried either to include interaction only (either false or true) and to use include_bias False but still I have interactions. I wan't to have only the variables with degrees. Becasue on literature there is nothing said about interactions between the variables — Amirgiano, Jul 03 '23 at 11:24

score 0 · Answer 3 · answered Jul 03 '23 at 13:22

0

Python doesn't provide you with just degree values. That's why my solution is

degree= 3
for column in X_poly.columns:
   X_poly[column + '_3'] = X_poly[column] ** degree

answered Jul 03 '23 at 13:22

Amirgiano

69
6

Polynomial regression creates 100+ variables from 10: I was expecting 20+constant

3 Answers3