0

I am working on a medicare fraud detection model. The data is very very imbalanced with 14 fraudulent positive cases and approximately 1 million non-fraudulent cases. I initially had 8 features, but with one-hot encoding of my categorical variables, I have 103 features (this is due to having 94 unique provider types). I created a pipeline that combines a Logistic Regression classifier with SMOTE.

##########
#Use pipeline - combination of SMOTE and logistic regression model 
# Define which resampling method and which ML model to use in the pipeline

resampling = SMOTE(random_state = 27, sampling_strategy = "minority")
model = LogisticRegression(solver='liblinear')
pipeline = Pipeline([('SMOTE', resampling), ('Logistic Regression', model)])

# Split your data X and y, into a training and a test set and fit the pipeline onto the training data
y = PartB_encoded['Is_fraud']
X = PartB_encoded.drop(['Is_fraud'], axis = 1)       
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=27)
pipeline.fit(X_train, y_train) 
predicted = pipeline.predict(X_test)       
print("Accuracy score: ", accuracy_score(y_true = y_test, y_pred = predicted))  
print("Precision score: ", precision_score(y_true = y_test, y_pred=predicted)) 
print("Recall score: ", recall_score(y_true = y_test, y_pred= predicted)) 

# Obtain the results from the classification report and confusion matrix 
print('Classifcation report:\n', classification_report(y_test, predicted))
conf_mat = confusion_matrix(y_true=y_test, y_pred=predicted)
print('Confusion matrix:\n', conf_mat)

This was my output:

Accuracy score:  0.9333130935552119
Precision score:  2.3716352424997034e-05
Recall score:  0.09090909090909091
Classification report:
               precision    recall  f1-score   support

       False       1.00      0.93      0.97    632407
        True       0.00      0.09      0.00        11

    accuracy                           0.93    632418
   macro avg       0.50      0.51      0.48    632418
weighted avg       1.00      0.93      0.97    632418

Confusion matrix:
 [[590243  42164]
 [    10      1]]

Obviously my recall and precision are extremely low and is not acceptable. How do I increase my recall and precision? I am considering undersampling, but I am afraid of cutting out too much data if I change my negative class from approx 1 million records --> 14 records to match my positive class. I am also considering removing features, but I am unsure how to determine which features to remove.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Alyssa
  • 11
  • 1
  • 2
    Although there are some techniques to use when the dataset is unbalanced, I don't think it will work in your case. 1 million vs 14: Not only the data is unbalanced, but 14 is way too little to learn on. You must gather (or even simulate yourself) more fraudulent data – Wazaki Aug 10 '20 at 06:19
  • Indeed as @Wazaki has said; please keep in mind that ML is not magic. – desertnaut Aug 10 '20 at 11:10

1 Answers1

0

We faced a similar issue dealing with financial fraud detection where in general actual fraud data is less than 0.1%. You have to undersample the predominant class(es) while taking care to ensure that representation of various inner classes remains intact. So perform a clustering on your predominant population first and then select from each cluster to create a trimmed down population for the predominant class. Try with ratios like 80:20, 90:10 etc. till you achieve respectable precision and recall. Oversampling techniques like SMOTE are not really advisable as synthetically prepared data will diverge from real data in most cases

dipanjanb
  • 51
  • 3