0

(I leave my dataset at the bottom line). I'm trying to use Linear Regression on a dataset where predictors are the product ID, weight, type, Outlet_Establishment_Year, etc and target variable is the Item_Outlet_Sales. I use R-squared as the metric. I think the predictors have different units so I'll need to scale them. If I do so:

X = cleaned_data.iloc[:, :-1] # predictors
X = pd.get_dummies(data = X, drop_first = True) # convert categorical variables to numerical variables
Y = cleaned_data.iloc[:, -1] # target

Then I scale the data, perform Linear Regression and calculate R-squared which yield 0.57 as a result:

from sklearn.preprocessing import StandardScaler

concat_data = pd.concat([X, Y], axis = 1)
scaled_data = StandardScaler().fit_transform(concat_data)

X_scaled = scaled_data[:, :-1]
Y_scaled = scaled_data[:, -1]

print(X_scaled.shape, Y_scaled.shape)

from sklearn.linear_model import LinearRegression

LR_scaled_model = LinearRegression()
LR_scaled_model.fit(X_scaled, Y_scaled)

from sklearn.metrics import *
predicted_sales = LR_scaled_model.predict(X_scaled)
print('R-squared:', r2_score(Y_scaled, predicted_sales))

And if I just implement Linear Regression without scaling, the R-squared is 0.67

LR_non_scaling_model = LinearRegression()
LR_non_scaling_model.fit(X, Y)

predicted_sales = LR_non_scaling_model.predict(X)
print('R-squared:', r2_score(Y, predicted_sales))

How would you explain this? And, in linear regression tasks, when should I and when should not I scale my data?

Dataset: https://drive.google.com/file/d/1AeK2aCnKtr0xMHz1B_Vfq4HnIkd2pxW_/view?usp=share_link

1 Answers1

0

It seems like the scaling is also applied to the one-hot-encoded dummy variable which IMO should not happen. If you only scale continuous variables, does that change the behavior?

Generally, scaling only affects the interpretation of the coefficients and not the quality of the model. After standard scaling, a coefficient $\beta_1$ can be interpreted as:

A one standard deviation change in the independent variable is associated with a $\beta_1$ change in the dependent variable

Moritz Wilksch
  • 141
  • 2
  • 5