Getting 100% Accuracy on my DecisionTree Model

Question

Here is my code, and it always returns 100% accuracy, regardless of how big the test size is. I used the train_test_split method, so I do not believe there should be any duplicates of data. Could someone inspect my code?

from sklearn.tree import DecisionTreeClassifier
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


data = pd.read_csv('housing.csv')

prices = data['median_house_value']
features = data.drop(['median_house_value', 'ocean_proximity'], axis = 1)

prices.shape
(20640,)

features.shape
(20640, 8)


X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42)

X_train = X_train.dropna()
y_train = y_train.dropna()
X_test = X_test.dropna()
y_test = X_test.dropna()

model = DecisionTreeClassifier()
model.fit(X_train, y_train)

y_train.shape
(16512,)

X_train.shape
(16512, 8)


predictions = model.predict(X_test)
score = model.score(y_test, predictions)
score

What do you mean by "regardless of how big the test size is"? I doubt that if you set test size to be say 90% of the data, the model will still give you 100% accuracy. Moreover, getting 100% accuracy on a simple dataset is not that big of a concern. What does concern me is why you are using `Decision tree classifier` instead of `Decision tree regressor` for house price classification. Maybe that is your answer. — Akshay Sehgal, Nov 23 '20 at 18:06
Why `DecisionTreeClassifier()` ? Is this not a regression problem? — Prayson W. Daniel, Nov 23 '20 at 18:07
Please note, this is not overfitting. Overfitting is when your training accuracy is quite high but your validation accuracy is comparatively much lower. This is a sign that your model is fitting very well on your training data but not generalizing on unseen data. — Akshay Sehgal, Nov 23 '20 at 18:14

Akshay Sehgal · Accepted Answer · 2020-11-23T18:48:22.880

1

EDIT: I have reworked my answer since I found multiple issues. Please copy-paste the below code to ensure no bugs are left.

Issues -

You are using DecisionTreeClassifier instead of DecisionTreeRegressor for a regression problem.
You are removing nans after doing the test train split which will mess up the count of samples. Do the data.dropna() before the split.
You are using the model.score(X_test, y_test) incorrectly by passing it (X_test, predictions). Please use accuracy_score(X_test, predictions) with those parameters instead, or fix the syntax.

from sklearn.tree import DecisionTreeRegressor #<---- FIRST ISSUE
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


data = pd.read_csv('housing.csv')

data = data.dropna() #<--- SECOND ISSUE

prices = data['median_house_value']
features = data.drop(['median_house_value', 'ocean_proximity'], axis = 1)

X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42)

model = DecisionTreeRegressor()
model.fit(X_train, y_train)

predictions = model.predict(X_test)
score = accuracy_score(y_test, predictions) #<----- THIRD ISSUE
score

edited Nov 23 '20 at 18:48

answered Nov 23 '20 at 18:09

Akshay Sehgal

18,741
3
21
51

This still returns 100% accuracy – Nick ZH Nov 23 '20 at 18:18
Could you please change your test data size to 0.9 and tell me what happens? – Akshay Sehgal Nov 23 '20 at 18:23
It still returns 100% accuracy with 0.9 test data. – Nick ZH Nov 23 '20 at 18:29
Never mind I found your issue. Please check the model.score statement. you are using it incorrectly. Check my updated comment. Let me know if it fixes it. – Akshay Sehgal Nov 23 '20 at 18:29
It returns ValueError: y_true and y_pred have different number of output (8!=1) – Nick ZH Nov 23 '20 at 18:35
Thats because you are dropping `na` after the test train split. Please drop `na` from the dataset before that and then try. – Akshay Sehgal Nov 23 '20 at 18:36
I have updated my code, please copy-paste it exactly as is and then let me know. – Akshay Sehgal Nov 23 '20 at 18:43
Fixed it, but very low accuracy for some reason (sorry about all the mistakes I'm very new to using sklearn) – Nick ZH Nov 23 '20 at 18:49
That's because of the model and hyperparameters. Try using Randomforest regressor and see. Or try better features. Not sure why my answer got a downvote though lol. – Akshay Sehgal Nov 23 '20 at 18:50

Getting 100% Accuracy on my DecisionTree Model

1 Answers1