Undersampling before or after Train/Test Split

Question

I have a credit card dataset with 98% transactions are Non-Fraud and 2% are fraud.

I have been trying to undersample the majotrity class before train and test split and get very good recall and precision on the test set.

When I do the undersampling only on training set and test on the independent set I get a very poor precision but the same recall!

My question is :

Should I undersample before splitting into train and test , will this mess with the distribution of the dataset and not be representative of the real world?

Or does the above logic only apply when oversampling?

Thank you

score 7 · Accepted Answer · answered Feb 09 '21 at 19:30

7

If you have a chance to collect more data, that could be the best solution. (Assuming that you already attempted this step)

If precision is poor and recall is good which indicating that your model is good at predicting fraud class as fraud but the model is confusing for nonfraud class, most of the times it is predicting nonfraud class as fraud (if you set 0 for majority class 1 for minority class). This means that you have to try on reducing the undersampling rate for the majority class.

Typically undersampling/oversampling will be done on train split only, this is the correct approach. However,

Before undersampling, make sure your train split has class distribution as same as the main dataset. (Use stratified while splitting)
If you are using python sklearn library for training your classifier set the parameter class_weight='balanced'.

For example:

   from sklearn.linear_model import LogisticRegression
   Lr = LogisticRegression(class_weight='balanced')

Try with different algorithms with different hyperparameters, if the model is underfitting then consider choosing XGboost.

If you do undersample before splitting then the test split distribution may not replicate the distribution of real-world data. Hence people typically avoid sampling before splitting.

answered Feb 09 '21 at 19:30

Santosh Pothabattula

96
3

This is a really helpful answer. I just had a couple more question: 1. " This means that you have to try on reducing the undersampling rate for the majority class." do you mean I should not undersample the majority class a lot? for example if I have 100 in majority and 2 in minority (Train only). The current sampling makes both 2-2. , – Vardaan Khanted Feb 09 '21 at 20:59
2. The class weight parameter cannot be used after undersampling right? as we are already making the number of samples equal for majority and minority classes. What steps can I take to impove the precision, the recall is close to 0.75 but the precision is 0.03 with xgboost as well – Vardaan Khanted Feb 09 '21 at 21:03
Try with different sampling rates with different weight ratios (majority: minority) 50%:50%, 40% :60%,30%:70% etc, see which will give better result. You do this easily with the `class_weight` parameter without performing the manual sampling (as you are aware). However, typically for this kind of objective **FP** is not the costly error, but **FN** is. Hence we can put more interest on increasing **Recall**. – Santosh Pothabattula Feb 10 '21 at 06:15
So you do you think I should focus more on getting a higher recall vs a higher precision, but what would that mean. We get all the frauds correctly (high recall) , but we also missclassifying a lot of legit transactions as fraud(low preciison). I am just trying to understand the meaning of the trade-off. Is my understanding right? – Vardaan Khanted Feb 10 '21 at 14:43
3

See, in **similar** real-world cases, say if single **fraud** transaction was misclassified as **non-fraud** then it can impact business very badly (even for single **FN**). Let if a **non-fraud** transaction was predicted as **fraud** (**FP**), it will not (or very less) impact the business revenue, because later with further business checks we understand that was misclassified as **fraud**. But the model should not give a single chance to escape true**frauds** as it requires immediate action. Hence I mentioned for **this kind** of cases **FN** are very costlier than **FP**. – Santosh Pothabattula Feb 10 '21 at 17:30
1

Say you are dealing with a business problem like **sentiment classification** for any product reviews, in such cases both **Recall** and **Precision** are important. hope it helps – Santosh Pothabattula Feb 10 '21 at 17:39

Undersampling before or after Train/Test Split

1 Answers1