0

I am using the credit card fraud dataset(link below) and it's highly imbalanced where the positive class has only 492 instances and the negative class has 284315 instances.

I was applying PU Bagging (link below) on it to extract hidden positives in the negative class i.e negative instances having similar property/values like positive instances. In the max_samples hyperparameter, I was putting sum(y) which worked, but just for testing purposes I typed max_samples as 1000 just to check if it gives an error, but it does not. If I have given max_samples=1000 that means it should take 1000 samples from both classes but it did not give me any error. I also tested with values less than 492 like 30 but it still worked and I also tried with bootstrap and oob_score as False but still no error. I also tried giving max_samples as a list like [492,492] but it does not accept a list like that.

I want the classifier to take 492 samples from both the classes as [492,492] but i don't know its doing that or not.

Link for the dataset: https://machinelearningmastery.com/standard-machine-learning-datasets-for-imbalanced-classification/

Link for pu_bagging code: https://github.com/roywright/pu_learning/blob/master/baggingPU.py

My code is:

#importing and preprocessing

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
df_bank=pd.read_csv('testingcreditcard.csv')
y_bank=df_bank['labels']
df_bank.drop(['labels'],axis=1,inplace=True)

#counter
unique, counts = np.unique(y_bank, return_counts=True)
dict(zip(unique, counts))

#Pu_bagging
from sklearn.ensemble import RandomForestClassifier
from baggingPU import BaggingClassifierPU
bc = BaggingClassifierPU(RandomForestClassifier(), n_estimators = 200, n_jobs = -1, max_samples = 30 )
bc.fit(df_bank, y_bank)

#Predictions
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report
rpredd=bc.predict(df_bank)
print(confusion_matrix(y_bank,rpredd))
print(accuracy_score(y_bank,rpredd))
print(classification_report(y_bank,rpredd))
desertnaut
  • 57,590
  • 26
  • 140
  • 166
a.ydv
  • 31
  • 5
  • `max_samples` does not do what you seem to think it does; please see the [docs](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html). – desertnaut Mar 15 '21 at 09:29
  • Thanks sir, I know it means the no. of samples to get from dataset, but this is different from typical bagging, plz read the starting comments of authors pasted below or at (https://github.com/roywright/pu_learning/blob/master/baggingPU.py). Also read his article at (https://roywrightme.wordpress.com/2017/11/16/positive-unlabeled-learning/). Comments = ""for a PU problem with 500 positives and 10000 unlabeled, we might set max_samples = [500, 500] (to balance P and U in each bag) and bootstrap = [True, False] (to only bootstrap the unlabeled)."" Plz tell me if i am missing something major. – a.ydv Mar 15 '21 at 10:32
  • Sorry, my bad - I misread the code, thinking that here `max_samples` was a parameter if the `RandomForestClassifier()`, but it is not. – desertnaut Mar 15 '21 at 10:34

1 Answers1

0

Digging into the code a bit, I found this section:

if bootstrap:
    indices = random_state.randint(0, n_population, n_samples)
else:
    indices = sample_without_replacement(n_population, n_samples,
                                         random_state=random_state)

and I think when initializing, I am assuming bootstrap is by default True, which I think means it is sampling with replacement. So even if you specify 1000, it is replicating from the 492 samples to create 1000 samples.