I have applied Sklearn DecisionTreeClassifier() on a personalized dataset to perform binary classification (class 0 and class 1).
Initially classes were not balanced I tried to balance them using :
rus = RandomUnderSampler(random_state=42, replacement=True)
data_rus, target_rus = rus.fit_resample(X, y)
So my dataset was balanced with 186404 samples for class 1 and 186404 samples for class 2. The training samples were : 260965 and the testing samples were : 111843
I calculated the accuracy using sklearn.metrics
and I got the next result:
clf=tree.DecisionTreeClassifier("entropy",random_state = 0)
clf.fit(data_rus, target_rus)
accuracy_score(y_test,clf.predict(X_test)) # I got 100% for both training and testing
clf.score(X_test, y_test) # I got 100% for both training and testing
So, I got 100% as accuracy for both testing and training phases I am sure that the result is abnormal and I could not understand if it is an overfitting or data leakage despite I had shuffled my data before splitting it. Then I have decided to plot both training and testing accuracy using
sklearn.model_selection.validation_curve
I got the next figure and I could not interpret it :
I tried two other classification algorithms : Logistic Regression and SVM, I have got the next testing accuracy: 99,84 and 99,94%, respectively.
Update In my original dataset I have 4 categorical columns I mapped then using the next code:
DataFrame['Color'] = pd.Categorical(DataFrame['Color'])
DataFrame['code_Color'] = DataFrame.Color.cat.codes
After using the RandomUnderSampler to under sample my original data so to get a class balance I splitted the data into train and test datasets train_test_split of sklearn
Any idea could be helpful for me please!