I have an imbalanced dataset and need to balance it to predict the target with tree-based regressors like DecisionTreeRegressor.
To balance, I found solutions like using:
- Square Root Transformation
- Log Transformation
- Box-Cox Transformation
However, I can't use methods 2 and 3 because I have too many 0 in the y array and I get errors like:
RuntimeWarning: divide by zero encountered in the log
So far, I can only use the square root transform method:
transformati='square_root'
y_train_transformed = np.sqrt(y_train)
reg = DecisionTreeRegressor().fit(X_train, y_train_transformed)
# create predictions on the test set
preds = reg.predict(X_test)
# transform back
preds = preds **2
# get mse and r2
r2 = r2_score(y_test, preds)
mse = mean_squared_error(y_test, preds)
# store in results dict
results_dict[transformation] = [r2, mse]
But when I compare this result with the result I get from the regression without any changes, I see that this method does not work well for me.:
transformat='no transformation'
reg = DecisionTreeRegressor().fit(X_train, y_train)
# create predictions on test set
preds = reg.predict(X_test)
# get mse and r2
r2 = r2_score(y_test, preds)
mse = mean_squared_error(y_test, preds)
# store in results dict
results_dict[transformation] = [r2, mse]
The final results:
import pandas as pd
df_results = pd.DataFrame.from_dict(results_dict, orient="index", columns=["R2-Score", "MSE"])
df_results
Since I didn't get a better answer when I tried to balance the dataset with this solution, I keep looking for other solutions and I find that there is a SMOTE library for regression in Python, but I can't find a nice example to know how I can use it.
So far, I only find this example:
## load libraries
import smogn
import pandas
## load data
housing = pandas.read_csv(
## http://jse.amstat.org/v19n3/decock.pdf
"https://raw.githubusercontent.com/nickkunz/smogn/master/data/housing.csv")
## conduct smogn
housing_smogn = smogn.smoter(
data = housing,
y = "SalePrice"
)
https://github.com/nickkunz/smogn
But it is still not clear to me how to use this method:
import smogn
smogn = smogn.smoter(
# I am not sure if it is correct?
data = X_train,
y = y_train
)
reg = DecisionTreeRegressor().fit(smogn[data], smogn[y]) # I do not know what should I put here?
Can someone explain to me how to use this method?
Also, knowing that I have a lot of 0 values in my dataset, are there any other methods I can use to balance the dataset?