is it overfitting or data leakage problem?

Question

I have applied Sklearn DecisionTreeClassifier() on a personalized dataset to perform binary classification (class 0 and class 1).

Initially classes were not balanced I tried to balance them using :

rus = RandomUnderSampler(random_state=42, replacement=True)
data_rus, target_rus = rus.fit_resample(X, y)

So my dataset was balanced with 186404 samples for class 1 and 186404 samples for class 2. The training samples were : 260965 and the testing samples were : 111843 I calculated the accuracy using sklearn.metrics and I got the next result:

clf=tree.DecisionTreeClassifier("entropy",random_state = 0)
clf.fit(data_rus, target_rus)
accuracy_score(y_test,clf.predict(X_test)) # I got 100% for both training and testing 
clf.score(X_test, y_test) # I got 100% for both training and testing

So, I got 100% as accuracy for both testing and training phases I am sure that the result is abnormal and I could not understand if it is an overfitting or data leakage despite I had shuffled my data before splitting it. Then I have decided to plot both training and testing accuracy using

sklearn.model_selection.validation_curve

I got the next figure and I could not interpret it :

I tried two other classification algorithms : Logistic Regression and SVM, I have got the next testing accuracy: 99,84 and 99,94%, respectively.

Update In my original dataset I have 4 categorical columns I mapped then using the next code:

DataFrame['Color'] = pd.Categorical(DataFrame['Color'])
DataFrame['code_Color'] = DataFrame.Color.cat.codes

After using the RandomUnderSampler to under sample my original data so to get a class balance I splitted the data into train and test datasets train_test_split of sklearn

Any idea could be helpful for me please!

It really depends on the data.. for a simple case it may well be that you have such scores. Another hint: 100% score might suggest that you included the labels column in the data, it happened to me a couple of times and i had a hard time discover it. A high max_depth means a complex tree structure so try to limit it to avoid overfitting. The graph you posted suggests that a value of 11 should be enough. — Gigioz, Dec 24 '20 at 18:32
If I fix max_depth to 11 I will get 99,.. do you think it is ok because my training dataset if very large so according to me it is difficult to build a model that fit very well that dataset? — baddy, Dec 24 '20 at 22:53
Look...if you could get those high scores with logistic regression and svm the problem is pretty linear, so.. no, it should be easy to build a model. What I wanted to say is that one usually fixes the model structure when validation score is max. At some point it wilp drop will the training score always increases. Maybe 11 is low but 13 or 14... beyond you are overfitting the data. — Gigioz, Dec 25 '20 at 08:32
What scores do you get for instance if you use "train_test_split" instead of "randomUnderSampler"? What happens if you don't specify "random_state"? What do you get with nearest neighbours classifier or SVM + rbf (non linear)? — Gigioz, Dec 25 '20 at 08:35
Is there correlation in the data or some obvious separation in one dimension? Try to plot the data, is my suggestion. — Gigioz, Dec 25 '20 at 08:42
@Gigioz I think I need to update the question : I use RandomUnderSampler to get class balance because in my original dataset I have 5953541 class 0 and 185444 for class 1 after that I will split data_rus, target_rus with train_test_split with 30% testing data — baddy, Dec 25 '20 at 09:34
I used the random_state to get the same result each time I run the script. I will try with KNN. — baddy, Dec 25 '20 at 09:36
@Gigioz for correlation my original dataset has 21 column where 4 of them are categorical so I mapped them manually to numerical codes — baddy, Dec 25 '20 at 09:37
mmmh ok, this is a lot of good information! Waiting for updates then. Are you using labelEncoder for the categorical data? — Gigioz, Dec 25 '20 at 13:47

is it overfitting or data leakage problem?

0 Answers0