(I leave my dataset at the bottom line). I'm trying to use Linear Regression on a dataset where predictors are the product ID, weight, type, Outlet_Establishment_Year, etc and target variable is the Item_Outlet_Sales. I use R-squared as the metric. I think the predictors have different units so I'll need to scale them. If I do so:
X = cleaned_data.iloc[:, :-1] # predictors
X = pd.get_dummies(data = X, drop_first = True) # convert categorical variables to numerical variables
Y = cleaned_data.iloc[:, -1] # target
Then I scale the data, perform Linear Regression and calculate R-squared which yield 0.57 as a result:
from sklearn.preprocessing import StandardScaler
concat_data = pd.concat([X, Y], axis = 1)
scaled_data = StandardScaler().fit_transform(concat_data)
X_scaled = scaled_data[:, :-1]
Y_scaled = scaled_data[:, -1]
print(X_scaled.shape, Y_scaled.shape)
from sklearn.linear_model import LinearRegression
LR_scaled_model = LinearRegression()
LR_scaled_model.fit(X_scaled, Y_scaled)
from sklearn.metrics import *
predicted_sales = LR_scaled_model.predict(X_scaled)
print('R-squared:', r2_score(Y_scaled, predicted_sales))
And if I just implement Linear Regression without scaling, the R-squared is 0.67
LR_non_scaling_model = LinearRegression()
LR_non_scaling_model.fit(X, Y)
predicted_sales = LR_non_scaling_model.predict(X)
print('R-squared:', r2_score(Y, predicted_sales))
How would you explain this? And, in linear regression tasks, when should I and when should not I scale my data?
Dataset: https://drive.google.com/file/d/1AeK2aCnKtr0xMHz1B_Vfq4HnIkd2pxW_/view?usp=share_link