2

I'm trying to apply baseline model to my data set. But the data set is imbalanced and only 11% of the data belongs to positive category. I split the data without sampling, the recall for positive records is very low. I want to balance the training data(0.5 negative 0.5 positive) without balancing testing data. Does anyone know how to do that?

#splitting train and test data
train,test = train_test_split(coupon,test_size = 0.3,random_state = 100)

##separating dependent and independent variables
cols = [i for i in coupon.columns if i not in target_col]
train_X = train[cols]
train_Y = train[target_col]
test_X = test[cols]
test_Y = test[target_col]

#Function attributes
#dataframe     - processed dataframe
#Algorithm     - Algorithm used 
#training_x    - predictor variables dataframe(training)
#testing_x     - predictor variables dataframe(testing)
#training_y    - target variable(training)
#training_y    - target variable(testing)
#cf - ["coefficients","features"](cooefficients for logistic 
#regression,features for tree based models)

#threshold_plot - if True returns threshold plot for model
def coupon_use_prediction(algorithm,training_x,testing_x,
                         training_y,testing_y,cols,cf,threshold_plot) :

#model
algorithm.fit(training_x,training_y)
predictions   = algorithm.predict(testing_x)
probabilities = algorithm.predict_proba(testing_x)
#coeffs
if   cf == "coefficients" :
    coefficients  = pd.DataFrame(algorithm.coef_.ravel())
elif cf == "features" :
    coefficients  = pd.DataFrame(algorithm.feature_importances_)

column_df     = pd.DataFrame(cols)
coef_sumry    = (pd.merge(coefficients,column_df,left_index= True,
                          right_index= True, how = "left"))
coef_sumry.columns = ["coefficients","features"]
coef_sumry    = coef_sumry.sort_values(by = "coefficients",ascending = False)

print (algorithm)
print ("\n Classification report : \n",classification_report(testing_y,predictions))
print ("Accuracy   Score : ",accuracy_score(testing_y,predictions))
Stella
  • 65
  • 4
  • 2
    No. You're solving an XY problem. You have imbalanced data, and now you're trying to just balance out the data on training step, and keep it imbalanced at testing step in an attempt to "solve" the problem. Do not just blindly change the data, you want to read up on how to properly handle imbalanced data. (Some of it will actually be slightly adjusting the ratio of train to test data, but it's not to the extent of making it 50-50. And there's a lot of other things that you can and should try as well, including changing your metrics, adding weights if the algorithm supports it, and so on). – Paritosh Singh Dec 24 '19 at 08:13
  • reference: [XY Problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem) . Your X is: how to train a good model given imbalanced data. Your Y sounds like: how do i simply make the data equally proportionate on training time, while leaving test data untouched. – Paritosh Singh Dec 24 '19 at 08:14
  • I see. You are right. I want to handle the imbalanced data. I will search more information about that. Thanks! – Stella Dec 24 '19 at 08:29

1 Answers1

1

You have to way of balancing data : up sampling or down sampling.

Up sampling : duplication of the under-represented data. Down sampling : sampling of the over-represented data.

For the upsampling it is pretty much easy. For the downsampling you can use sklearn.utils.resample and provide the number of sample you want to get.

Please note that as @paritosh-singh mentioned, changing the distribution may not be the only solution. There are machine learning algorithms that can: - support imbalanced data - already have built-in weighting option to takes in account the data distribution

MrMey
  • 11
  • 1