Nonlinear feature transformation in python

Question

In order to fit a linear regression model to some given training data X and labels y, i want to augment my training data X by nonlinear transformations of the given features. Let's say we have the feature x₁, x₂ and x₃. And we want to use the additional transformed features:

x₄ = x₁², x₅ = x₂² and x₆ = x₃²

x₇ = exp(x₁), x₈ = exp(x₂) and x₉ = exp(x₃)

x₁₀ = cos(x₁), x₁₁ = cos(x₂) and x₁₂ = cos(x₃)

I tried the following approach, which however lead to a model that performed very poorly in terms of Root Mean Squared Error as evaluation criterion:

import pandas as pd
import numpy as np
from sklearn import linear_model
#import the training data and extract the features and labels from it
DATAPATH = 'train.csv'
data = pd.read_csv(DATAPATH)
features = data.drop(['Id', 'y'], axis=1)
labels = data[['y']]

features['x6'] = features['x1']**2
features['x7'] = features['x2']**2
features['x8'] = features['x3']**2


features['x9'] = np.exp(features['x1'])
features['x10'] = np.exp(features['x2'])
features['x11'] = np.exp(features['x3'])


features['x12'] = np.cos(features['x1'])
features['x13'] = np.cos(features['x2'])
features['x14'] = np.cos(features['x3'])

regr = linear_model.LinearRegression()

regr.fit(features, labels)

I'm quite new to ML and there is for sure a better option to do these nonlinear feature transformations, I'm very happy for your help.

Cheers Lukas

My intuition is that the `np.exp` terms are much much larger than everything else in your dataset, so your regression fits only them. You can avoid that by normalising your data before training the classifier. Check out [this post](https://stats.stackexchange.com/questions/189652/is-it-a-good-practice-to-always-scale-normalize-data-for-machine-learning) — warped, Mar 08 '20 at 10:39

FBruzzesi · Answer 1 · 2022-10-10T06:56:40.660

As initial remark, I think there is a better way to transform all columns. One option would be something like:

# Define list of transformation
trans = [lambda a: a, np.square, np.exp, np.cos]

# Apply and concatenate transformations
features = pd.concat([t(features) for t in trans], axis=1)

# Rename column names
features.columns = [f'x{i}' for i in range(1, len(list(features))+1)]

Regarding performances of the model, as @warped said in the comment, it is a usual practice to scale all your data. Depending of your data distribution you can use different types of scaler (a discussion about it standard vs minmax scaler).

Since you are using nonlinear transformations, even though your initial data may be normal distributed, after transformations they will lose such property. Therefore it may be better to use MinMaxScaler.

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(features.to_numpy())
scaled_features = scaler.transform(features.to_numpy())

Now each column of scaled_features will range from 0 to 1.

Remark that if you apply scaler before using something like train_test_split, data leakage may happen, and this is also not good for the model.

Nonlinear feature transformation in python

1 Answers1