0

I'm a beginner in Machine Learning and I'm trying to learn by working through Kaggle's Titanic problem. From what I know, I've made sure that the metrics are in sync with one another but of course I blame myself for this problem and not Python. However, I still couldn't find the source and Spyder IDE is no help.

This is my code:

import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

"""Assigning the train & test datasets' adresses to variables"""
train_path = "C:\\Users\\Omar\\Downloads\\Titanic Data\\train.csv"
test_path = "C:\\Users\\Omar\\Downloads\\Titanic Data\\test.csv"

"""Using pandas' read_csv() function to read the datasets
and then assigning them to their own variables"""
train_data = pd.read_csv(train_path)
test_data = pd.read_csv(test_path)

"""Using pandas' factorize() function to represent genders (male/female)
with binary values (0/1)"""
train_data['Sex'] = pd.factorize(train_data.Sex)[0]
test_data['Sex'] = pd.factorize(test_data.Sex)[0]

"""Replacing missing values in the training and test dataset with 0"""
train_data.fillna(0.0, inplace = True)
test_data.fillna(0.0, inplace = True)

"""Selecting features for training"""
columns_of_interest = ['Pclass', 'Sex', 'Age']

"""Dropping missing/NaN values from the training dataset"""
filtered_titanic_data = train_data.dropna(axis=0)

"""Using the predictory features in the data to handle the x axis"""
x = filtered_titanic_data[columns_of_interest]

"""The survival (what we're trying to find) is the y axis"""
y = filtered_titanic_data.Survived

"""Splitting the train data with test"""
train_x, val_x, train_y, val_y = train_test_split(x, y, random_state=0)

"""Assigning the DecisionTreeRegressor model to a variable"""
titanic_model = DecisionTreeRegressor()

"""Fitting the x and y values with the model"""
titanic_model.fit(train_x, train_y)

"""Predicting the x-axis"""
val_predictions = titanic_model.predict(val_x)

"""Assigning the feature columns from the test to a variable"""
test_x = test_data[columns_of_interest]

"""Predicting the test by feeding its x axis into the model"""
test_predictions = titanic_model.predict(test_x)

"""Printing the prediction"""
print(val_predictions)

"""Checking for the accuracy"""
print(accuracy_score(val_y, val_predictions))

"""Printing the test prediction"""
print(test_predictions)

and this is the stacktrace:

Traceback (most recent call last):

  File "<ipython-input-3-73797c87986e>", line 1, in <module>
    runfile('C:/Users/Omar/Downloads/Kaggle Competition/Titanic.py', wdir='C:/Users/Omar/Downloads/Kaggle Competition')

  File "C:\Users\Omar\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
    execfile(filename, namespace)

  File "C:\Users\Omar\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "C:/Users/Omar/Downloads/Kaggle Competition/Titanic.py", line 58, in <module>
    print(accuracy_score(val_y, val_predictions))

  File "C:\Users\Omar\Anaconda3\lib\site-packages\sklearn\metrics\classification.py", line 176, in accuracy_score
    y_type, y_true, y_pred = _check_targets(y_true, y_pred)

  File "C:\Users\Omar\Anaconda3\lib\site-packages\sklearn\metrics\classification.py", line 81, in _check_targets
    "and {1} targets".format(type_true, type_pred))

ValueError: Classification metrics can't handle a mix of binary and continuous targets
desertnaut
  • 57,590
  • 26
  • 140
  • 166
Onur-Andros Ozbek
  • 2,998
  • 2
  • 29
  • 78
  • Since the Titanic data are easily available, this is a good opportunity to make your code fully [reproducible](https://stackoverflow.com/help/mcve); so, next time, it will be a good practice to do so by explicitly include your (relevant) imports (instead of `import ...`)... – desertnaut Sep 13 '18 at 16:47
  • 1
    Thanks. I've added the imports. – Onur-Andros Ozbek Sep 13 '18 at 16:50
  • Please, don't edit the post to include the remedy & the new result without the error! It is supposed to stay as it is for possible help of others in the future (edited it back myself)! – desertnaut Sep 13 '18 at 17:03

3 Answers3

1

You are using a DecisionTreeRegressor, which as it says, is a regressor model. The Kaggle Titanic problem is a classification problem. So you should use a DecisionTreeClassifier.

As for why your code is throwing an error, it is because val_y has binary values (0,1) whereas val_predictions has continuous values because you used a Regressor model.

Raunaq Jain
  • 917
  • 7
  • 13
1

You are trying to use a regression algorithm (DecisionTreeRegressor) for a binary classification problem; the regression model, as expected, gives continuous outputs, but the accuracy_score, where the error actually happens:

File "C:/Users/Omar/Downloads/Kaggle Competition/Titanic.py", line 58, in <module>
    print(accuracy_score(val_y, val_predictions)) 

expects binary ones, hence the error.

For starters, change your model to

from sklearn.tree import DecisionTreeClassifier

titanic_model = DecisionTreeClassifier()
desertnaut
  • 57,590
  • 26
  • 140
  • 166
0

Classification needs discrete label as it predicts class(which is any one of the label) and Regression works with continuous data. As your output is class label, you need to perform classification

S.Dasgupta
  • 61
  • 9