0

I have an imbalanced dataset and need to balance it to predict the target with tree-based regressors like DecisionTreeRegressor.
To balance, I found solutions like using:

  1. Square Root Transformation
  2. Log Transformation
  3. Box-Cox Transformation

However, I can't use methods 2 and 3 because I have too many 0 in the y array and I get errors like:

RuntimeWarning: divide by zero encountered in the log

So far, I can only use the square root transform method:

transformati='square_root'
y_train_transformed = np.sqrt(y_train)
reg = DecisionTreeRegressor().fit(X_train, y_train_transformed)
# create predictions on the test set
preds = reg.predict(X_test)
# transform back
preds = preds **2
 # get mse and r2
r2 = r2_score(y_test, preds)
mse = mean_squared_error(y_test, preds) 
# store in results dict
results_dict[transformation] = [r2, mse] 

But when I compare this result with the result I get from the regression without any changes, I see that this method does not work well for me.:

transformat='no transformation'
reg = DecisionTreeRegressor().fit(X_train, y_train)    
# create predictions on test set
preds = reg.predict(X_test)
 # get mse and r2
r2 = r2_score(y_test, preds)
mse = mean_squared_error(y_test, preds) 
# store in results dict
results_dict[transformation] = [r2, mse] 

The final results:

import pandas as pd
 df_results = pd.DataFrame.from_dict(results_dict, orient="index", columns=["R2-Score", "MSE"])
  df_results

enter image description here

Since I didn't get a better answer when I tried to balance the dataset with this solution, I keep looking for other solutions and I find that there is a SMOTE library for regression in Python, but I can't find a nice example to know how I can use it.
So far, I only find this example:

## load libraries
import smogn
import pandas

## load data
housing = pandas.read_csv(

## http://jse.amstat.org/v19n3/decock.pdf
"https://raw.githubusercontent.com/nickkunz/smogn/master/data/housing.csv")

## conduct smogn
housing_smogn = smogn.smoter(
    
    data = housing, 
    y = "SalePrice"
)

https://github.com/nickkunz/smogn

But it is still not clear to me how to use this method:

import smogn
smogn = smogn.smoter(
    # I am not sure if it is correct?
    data = X_train, 
    y = y_train  
)

reg = DecisionTreeRegressor().fit(smogn[data], smogn[y]) # I do not know what should I put here?

Can someone explain to me how to use this method?
Also, knowing that I have a lot of 0 values in my dataset, are there any other methods I can use to balance the dataset?

Sara
  • 419
  • 1
  • 6
  • 14

0 Answers0