Precision significantly drops when using entire dataset to test a classifier trained on undersampled data

Question

I'm doing the Kaggle Credit Card Fraud Detection.

There is a significant imbalance between Class = 1 (fraudulent transaction) and Class = 0 (not fraudulent). To compensate, I undersampled the data so that there was a 1:1 ratio between fraudulent and unfraudulent transaction (492 each). When I trained my Logistic Regression classifier on the undersampled/balanced data, it performed well. However, when I used that same classifier and tested it on the entire dataset, the recall was still good, but the precision dropped significantly.

I am aware that having a high recall is much more important for this type of problem, but I would still like to understand why the precision tanks, and if this is fine or not.

Code:

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split

def model_report(y_test, pred):
    print("Accuracy:\t", accuracy_score(y_test, pred))
    print("Precision:\t", precision_score(y_test, pred))
    print("RECALL:\t\t", recall_score(y_test, pred))
    print("F1 Score:\t", f1_score(y_test, pred))

df = pd.read_csv("data/creditcard.csv")
target = 'Class'
X = df.loc[:, df.columns != target]
y = df.loc[:, df.columns == target]
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

print("WITHOUT UNDERSAMPLING:")
clf = LogisticRegression().fit(x_train, y_train)
pred = clf.predict(x_test)
model_report(y_test, pred)

# Creates the undersampled DataFrame with 492 fraud and 492 clean
minority_class_len = len(df[df[target] == 1])
minority_class_indices = df[df[target] == 1].index
majority_class_indices = df[df[target] == 0].index
random_majority_indices = np.random.choice(majority_class_indices, minority_class_len, replace=False)
undersample_indices = np.concatenate([minority_class_indices, random_majority_indices])
undersample = df.loc[undersample_indices]

X_undersample = undersample.loc[:, undersample.columns != target]
y_undersample = undersample.loc[:, undersample.columns == target]
x_train, x_test, y_train, y_test = train_test_split(X_undersample, y_undersample, test_size=0.33, random_state=42)

print("\nWITH UNDERSAMPLING:")
clf = LogisticRegression().fit(x_train, y_train)
pred = clf.predict(x_test)
model_report(y_test, pred)

print("\nWITH UNDERSAMPLING & TESTING ON ENIRE DATASET:")
pred = clf.predict(X)
model_report(y, pred)

Output:

WITHOUT UNDERSAMPLING:
Accuracy:        0.9989679423750093
Precision:       0.7241379310344828
RECALL:          0.5637583892617449
F1 Score:        0.6339622641509434

WITH UNDERSAMPLING:
Accuracy:        0.9353846153846154
Precision:       0.9673202614379085
RECALL:          0.9024390243902439
F1 Score:        0.9337539432176657

WITH UNDERSAMPLING & TESTING ON ENIRE DATASET:
Accuracy:        0.9595936897618387
Precision:       0.03760913364674278
RECALL:          0.9105691056910569
F1 Score:        0.07223476297968398

When you say 'testing on entire dataset' do you mean trained on your subset and then evaluated on the whole dataset? It looks suspicious that in your undersampled results you have exactly the same precision and recall. Are you sure there's not an error there? — Robert King, Mar 27 '20 at 17:34
1. Yes, I trained the classifier on my subset and then evaluated it on the whole dataset for the last portion of the code. 2. Since I take 492 random values for the Class=0 data, the data changes every time I run it and the scores differ a little bit. I did find it strange how they equaled each other but then I ran it again multiple times and it looks to be an anomaly because it has not repeated. — Anuj S, Mar 27 '20 at 17:43
If you think some results are an anomaly (and you cannot reproduce them), please **remove** them from the question altogether (instead of cluttering the issue further with clarifications) - they just create unnecessary noise and don't help focusing on the real issue/question. — desertnaut, Mar 27 '20 at 18:01

Precision significantly drops when using entire dataset to test a classifier trained on undersampled data

0 Answers0