0

Here is my code, and it always returns 100% accuracy, regardless of how big the test size is. I used the train_test_split method, so I do not believe there should be any duplicates of data. Could someone inspect my code?

from sklearn.tree import DecisionTreeClassifier
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


data = pd.read_csv('housing.csv')

prices = data['median_house_value']
features = data.drop(['median_house_value', 'ocean_proximity'], axis = 1)

prices.shape
(20640,)

features.shape
(20640, 8)


X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42)

X_train = X_train.dropna()
y_train = y_train.dropna()
X_test = X_test.dropna()
y_test = X_test.dropna()

model = DecisionTreeClassifier()
model.fit(X_train, y_train)

y_train.shape
(16512,)

X_train.shape
(16512, 8)


predictions = model.predict(X_test)
score = model.score(y_test, predictions)
score 
Nick ZH
  • 41
  • 8
  • 2
    What do you mean by "regardless of how big the test size is"? I doubt that if you set test size to be say 90% of the data, the model will still give you 100% accuracy. Moreover, getting 100% accuracy on a simple dataset is not that big of a concern. What does concern me is why you are using `Decision tree classifier` instead of `Decision tree regressor` for house price classification. Maybe that is your answer. – Akshay Sehgal Nov 23 '20 at 18:06
  • Why `DecisionTreeClassifier()` ? Is this not a regression problem? – Prayson W. Daniel Nov 23 '20 at 18:07
  • Please note, this is not overfitting. Overfitting is when your training accuracy is quite high but your validation accuracy is comparatively much lower. This is a sign that your model is fitting very well on your training data but not generalizing on unseen data. – Akshay Sehgal Nov 23 '20 at 18:14

1 Answers1

1

EDIT: I have reworked my answer since I found multiple issues. Please copy-paste the below code to ensure no bugs are left.

Issues -

  1. You are using DecisionTreeClassifier instead of DecisionTreeRegressor for a regression problem.
  2. You are removing nans after doing the test train split which will mess up the count of samples. Do the data.dropna() before the split.
  3. You are using the model.score(X_test, y_test) incorrectly by passing it (X_test, predictions). Please use accuracy_score(X_test, predictions) with those parameters instead, or fix the syntax.
from sklearn.tree import DecisionTreeRegressor #<---- FIRST ISSUE
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


data = pd.read_csv('housing.csv')

data = data.dropna() #<--- SECOND ISSUE

prices = data['median_house_value']
features = data.drop(['median_house_value', 'ocean_proximity'], axis = 1)

X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42)

model = DecisionTreeRegressor()
model.fit(X_train, y_train)

predictions = model.predict(X_test)
score = accuracy_score(y_test, predictions) #<----- THIRD ISSUE
score
Akshay Sehgal
  • 18,741
  • 3
  • 21
  • 51