1

I'm a data science noob and am working on the Kaggle Titanic dataset. I'm running a Logistic Regression on it to predict whether passengers in the test data set survived or died.

I clean both the training and test data and run the Logistic Regression fit on the training data. All good.

train = pd.read_csv('train.csv')    
X_train = train.drop('Survived',axis=1)
y_train = train['Survived']
from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)

Then I run the prediction model on the test data as such:

test = pd.read_csv('test.csv') 
predictions = logmodel.predict(test)

I then try to print the Confusion Matrix:

from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(test,predictions))

I get an error that says:

ValueError: Classification metrics can't handle a mix of continuous-multioutput and binary targets

What does this mean and how do I fix it?

Some potential issues I see are:

  1. I'm doing something super dumb and wrong with that prediction model on the test data.
  2. The value for features "Age" and "Fare" (cost of passenger's ticket) are floats, while the rest are integers.

Where am I going wrong? Thanks for your help!

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Mike Chan
  • 125
  • 2
  • 6
  • Check `confusion_matrix` arguments: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html, you are supposed to pass two arrays, not the whole test dataset. – m-dz Nov 08 '17 at 17:10

2 Answers2

3

As m-dz has commented, confusion_matrix expects 2 arrays, while in your code you pass the whole test dataframe.

Moreover, another common mistake is not respecting the order of the arguments, which matters.

All in all, your should ask for

confusion_matrix(test['Survived'], predictions)
desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • Thanks. OK I see the problem with passing in the entire test dataframe. But the test data doesn't have a 'Survived' column (I am trying to predict that) and thus I get the **KeyError: 'Survived'** error. What should I try now? – Mike Chan Nov 08 '17 at 20:52
  • Just had a tiny epiphany. The test data doesn't have a 'Survived' column...and now I see how it would be impossible to run a confusion matrix to assess my model against the test data, since it has nothing to compare my predictions to. How would I go about assessing my predictions now? – Mike Chan Nov 08 '17 at 21:05
  • @MikeChan You will either set aside a portion of your `train` data for validation, or try cross-validation. In any case 1) this is completely out of scope re your original question 2) SO is not the venue for this type of tutorials - there are several [walk-through kernels](https://www.kaggle.com/c/titanic/kernels) for the Titanic data @ Kaggle aimed especially for beginners. – desertnaut Nov 08 '17 at 21:44
-3

Presumably your test consists of booleans (lived or died) while you predictions consist of floats (predicted probability of surviving). You should pick some threshold value and then generate booleans based on whether the predicted probability is greater than the threshold.

Acccumulation
  • 3,491
  • 1
  • 8
  • 12
  • Please use "presumably" only if there are indeed gaps in the relevant info. This is not at all the case here, as [`predict`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict) returns *labels*, not probabilities (the latter are returned by [`predict_proba`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_proba)) – desertnaut Nov 08 '17 at 17:40
  • @desertnaut There are indeed gaps in the relevant information. That `test.csv` contains a dataframe, not a series, while consistent with conventions, cannot be ascertained for certain without going to the kaggle site, and that `predict` returns labels can be ascertained only by looking at the documentation. As far as *information given in the question* is concerned, both the contents of `test` and `predictions` are unknown. – Acccumulation Nov 08 '17 at 23:06
  • 1
    Let me try to offer a friendly piece of advice, as slightly more experienced here at SO: 1) information in the docs is always both *relevant & available* 2) before start presuming, ask for clarifications in the comments 3) to keep on arguing on an evidently wrong (and arguably speculating) answer based on mistaken assumptions w/o any concrete code suggestions that has already started attracting downvotes is unproductive. We all do mistakes - remedy what can be remedied, dump what cannot, and carry on... – desertnaut Nov 09 '17 at 07:20
  • I am not disputing that my answer is incorrect, merely that your objections to are not entirely valid. And frankly, if someone posts a question on SO without bothering to state what their input is, they should expect that there's a good chance that there will be problems. "w/o any concrete code suggestions" I gave clear instructions on how to address what I thought the problem was. – Acccumulation Nov 09 '17 at 15:07
  • Downvoting continues, while you are still missing the point... :( – desertnaut Nov 09 '17 at 15:10