I am working on a medicare fraud detection model. The data is very very imbalanced with 14 fraudulent positive cases and approximately 1 million non-fraudulent cases. I initially had 8 features, but with one-hot encoding of my categorical variables, I have 103 features (this is due to having 94 unique provider types). I created a pipeline that combines a Logistic Regression classifier with SMOTE.
##########
#Use pipeline - combination of SMOTE and logistic regression model
# Define which resampling method and which ML model to use in the pipeline
resampling = SMOTE(random_state = 27, sampling_strategy = "minority")
model = LogisticRegression(solver='liblinear')
pipeline = Pipeline([('SMOTE', resampling), ('Logistic Regression', model)])
# Split your data X and y, into a training and a test set and fit the pipeline onto the training data
y = PartB_encoded['Is_fraud']
X = PartB_encoded.drop(['Is_fraud'], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=27)
pipeline.fit(X_train, y_train)
predicted = pipeline.predict(X_test)
print("Accuracy score: ", accuracy_score(y_true = y_test, y_pred = predicted))
print("Precision score: ", precision_score(y_true = y_test, y_pred=predicted))
print("Recall score: ", recall_score(y_true = y_test, y_pred= predicted))
# Obtain the results from the classification report and confusion matrix
print('Classifcation report:\n', classification_report(y_test, predicted))
conf_mat = confusion_matrix(y_true=y_test, y_pred=predicted)
print('Confusion matrix:\n', conf_mat)
This was my output:
Accuracy score: 0.9333130935552119
Precision score: 2.3716352424997034e-05
Recall score: 0.09090909090909091
Classification report:
precision recall f1-score support
False 1.00 0.93 0.97 632407
True 0.00 0.09 0.00 11
accuracy 0.93 632418
macro avg 0.50 0.51 0.48 632418
weighted avg 1.00 0.93 0.97 632418
Confusion matrix:
[[590243 42164]
[ 10 1]]
Obviously my recall and precision are extremely low and is not acceptable. How do I increase my recall and precision? I am considering undersampling, but I am afraid of cutting out too much data if I change my negative class from approx 1 million records --> 14 records to match my positive class. I am also considering removing features, but I am unsure how to determine which features to remove.