How to get better precision and recall with imbalanced dataset in python

Question

I am working on a medicare fraud detection model. The data is very very imbalanced with 14 fraudulent positive cases and approximately 1 million non-fraudulent cases. I initially had 8 features, but with one-hot encoding of my categorical variables, I have 103 features (this is due to having 94 unique provider types). I created a pipeline that combines a Logistic Regression classifier with SMOTE.

##########
#Use pipeline - combination of SMOTE and logistic regression model 
# Define which resampling method and which ML model to use in the pipeline

resampling = SMOTE(random_state = 27, sampling_strategy = "minority")
model = LogisticRegression(solver='liblinear')
pipeline = Pipeline([('SMOTE', resampling), ('Logistic Regression', model)])

# Split your data X and y, into a training and a test set and fit the pipeline onto the training data
y = PartB_encoded['Is_fraud']
X = PartB_encoded.drop(['Is_fraud'], axis = 1)       
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=27)
pipeline.fit(X_train, y_train) 
predicted = pipeline.predict(X_test)       
print("Accuracy score: ", accuracy_score(y_true = y_test, y_pred = predicted))  
print("Precision score: ", precision_score(y_true = y_test, y_pred=predicted)) 
print("Recall score: ", recall_score(y_true = y_test, y_pred= predicted)) 

# Obtain the results from the classification report and confusion matrix 
print('Classifcation report:\n', classification_report(y_test, predicted))
conf_mat = confusion_matrix(y_true=y_test, y_pred=predicted)
print('Confusion matrix:\n', conf_mat)

This was my output:

Accuracy score:  0.9333130935552119
Precision score:  2.3716352424997034e-05
Recall score:  0.09090909090909091
Classification report:
               precision    recall  f1-score   support

       False       1.00      0.93      0.97    632407
        True       0.00      0.09      0.00        11

    accuracy                           0.93    632418
   macro avg       0.50      0.51      0.48    632418
weighted avg       1.00      0.93      0.97    632418

Confusion matrix:
 [[590243  42164]
 [    10      1]]

Obviously my recall and precision are extremely low and is not acceptable. How do I increase my recall and precision? I am considering undersampling, but I am afraid of cutting out too much data if I change my negative class from approx 1 million records --> 14 records to match my positive class. I am also considering removing features, but I am unsure how to determine which features to remove.

Although there are some techniques to use when the dataset is unbalanced, I don't think it will work in your case. 1 million vs 14: Not only the data is unbalanced, but 14 is way too little to learn on. You must gather (or even simulate yourself) more fraudulent data — Wazaki, Aug 10 '20 at 06:19
Indeed as @Wazaki has said; please keep in mind that ML is not magic. — desertnaut, Aug 10 '20 at 11:10

dipanjanb · Answer 1 · 2020-08-13T13:03:33.853

We faced a similar issue dealing with financial fraud detection where in general actual fraud data is less than 0.1%. You have to undersample the predominant class(es) while taking care to ensure that representation of various inner classes remains intact. So perform a clustering on your predominant population first and then select from each cluster to create a trimmed down population for the predominant class. Try with ratios like 80:20, 90:10 etc. till you achieve respectable precision and recall. Oversampling techniques like SMOTE are not really advisable as synthetically prepared data will diverge from real data in most cases

How to get better precision and recall with imbalanced dataset in python

1 Answers1