0

I am learning data science and reading other people's scripts. There is this one titanic algorithm (kaggle) has this code to apply the Logistic Regression then supposedly export to a .csv file as suggested in the code. However, it always generates an error message after I run the code. The original script is found here, and the .csv data that's being read into the code is here: train.csv test.csv

From Input[24] to Input[28] are for setting up LogisticRegression. Up to Input[27] the code still runs without error. When running Input[28]:

    acc_log = predict_model(X_data, Y_data, logreg, X_test_kaggle, 'submission_Logistic.csv')

I receive an error message:

    ValueError: could not convert string to float: 'Q'

I tried to add "try/except" to bypass the error message so the code can continue.

    try:
        acc_log = predict_model(X_data, Y_data, logreg, X_test_kaggle, 'submission_Logistic.csv')
    except ValueError:
        pass

This code is a bit too sophisticated for me to debug to see which step goes wrong and where in the file that has the string in place of the desired input for a float. So I would like to ask for help here to better understand this and seek for a proper solution. Thanks.

anicehat
  • 45
  • 1
  • 1
  • 8
  • 2
    Please post a snippet of both the data set and the predict_model() function, without those, it's nearly impossible to tell. But clearly you're passing a string into a function that is expecting a float. – elPastor Apr 29 '17 at 02:57
  • Hi @pshep123 thank you for your input. I have edited the question and added in references. I am not sure which section of the code and data set I should copy and paste, so I listed the references here instead. – anicehat Apr 29 '17 at 06:46

1 Answers1

0

It looks like you didn't run cell 16 in the notebook link you provided, in which Embarked values are converted to integers (including the string value Q, which is throwing the error you're seeing):

Cell 16

# fill the missing values of Embarked feature with the most common occurance
freq_port = train_df.Embarked.dropna().mode()[0]
for dataset in combine:
    dataset['Embarked'] = dataset['Embarked'].fillna(freq_port)
train_df[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False)

for dataset in combine:
    dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)

train_df.head()

I just ran all the cells in order and the LogisticRegression section worked fine for me. Try shutting down your notebook and re-running all the cells in the order they appear.

A general data science tip:
When you've already trained a model but your predict() function is throwing an error, it's helpful to look at the test data you're inputting and try and figure out what's wrong there.
In this case, searching the values in X_test_kaggle for the string Q might have revealed the problem was with the Embarked field, and that could have served as a first breadcrumb in tracking the problem back to its source.

andrew_reece
  • 20,390
  • 3
  • 33
  • 58
  • Thanks @andrew_reece! I checked the code and my input was missing a line from this section. I tried to type things by hand so that did slip without noticing it. Good catch and thanks for sharing with me your experience too :) – anicehat Apr 29 '17 at 19:17