I am using the credit card fraud dataset(link below) and it's highly imbalanced where the positive class has only 492 instances and the negative class has 284315 instances.
I was applying PU Bagging (link below) on it to extract hidden positives in the negative class i.e negative instances having similar property/values like positive instances. In the max_samples
hyperparameter, I was putting sum(y)
which worked, but just for testing purposes I typed max_samples as 1000
just to check if it gives an error, but it does not. If I have given max_samples=1000
that means it should take 1000 samples from both classes but it did not give me any error. I also tested with values less than 492 like 30 but it still worked and I also tried with bootstrap and oob_score as False but still no error. I also tried giving max_samples as a list like [492,492]
but it does not accept a list like that.
I want the classifier to take 492 samples from both the classes as [492,492] but i don't know its doing that or not.
Link for the dataset: https://machinelearningmastery.com/standard-machine-learning-datasets-for-imbalanced-classification/
Link for pu_bagging code: https://github.com/roywright/pu_learning/blob/master/baggingPU.py
My code is:
#importing and preprocessing
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
df_bank=pd.read_csv('testingcreditcard.csv')
y_bank=df_bank['labels']
df_bank.drop(['labels'],axis=1,inplace=True)
#counter
unique, counts = np.unique(y_bank, return_counts=True)
dict(zip(unique, counts))
#Pu_bagging
from sklearn.ensemble import RandomForestClassifier
from baggingPU import BaggingClassifierPU
bc = BaggingClassifierPU(RandomForestClassifier(), n_estimators = 200, n_jobs = -1, max_samples = 30 )
bc.fit(df_bank, y_bank)
#Predictions
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report
rpredd=bc.predict(df_bank)
print(confusion_matrix(y_bank,rpredd))
print(accuracy_score(y_bank,rpredd))
print(classification_report(y_bank,rpredd))