0

I am working through the Titanic competition. This is my code so far:

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

train = pd.read_csv("https://raw.githubusercontent.com/oo92/Titanic-Kaggle/master/train.csv")
test = pd.read_csv("https://raw.githubusercontent.com/oo92/Titanic-Kaggle/master/test.csv")

train['Sex'].replace(['female', 'male'], [0, 1])
train['Embarked'].replace(['C', 'Q', 'S'], [1, 2, 3])

# Fill missing values in Age feature with each sex’s median value of Age
train['Age'].fillna(train.groupby('Sex')['Age'].transform("median"), inplace=True)

linReg = LinearRegression()

data = train[['Pclass', 'Sex', 'Parch', 'Fare', 'Age']]

# implement train_test_split
x_train, x_test, y_train, y_test = train_test_split(data, train['Survived'], test_size=0.2, random_state=0)

# Training the machine learning algorithm
linReg.fit(x_train, y_train)

# Checking the accuracy score of the model
accuracy = linReg.score(x_test, y_test)
print(accuracy*100, '%')

This line previously looked like this: data = train[['Pclass', 'Parch', 'Fare', 'Age']], which ended up giving me an accuracy score of 19.5%. I realized that I didn't include sex so I went ahead and did this:

data = train[['Pclass', 'Sex', 'Parch', 'Fare', 'Age']]

Then, I got the following error:

ValueError: could not convert string to float: 'female'

Here I realized that the changes that I've done to my train['Sex'] and train['Age'] did not reflect on the training and the testing of the model, which seems to be the reason why my model performed at 19.5%. How do I come across this problem?

UPDATE

After the first answer, i tried to modify this line accordingly :

train['Age'].fillna(train.groupby('Sex')['Age'].transform("median"), inplace=True)

with :

train['Age'] = train['Age'].fillna(train.groupby('Sex')['Age'].transform("median"), inplace=True)

And i then decided to print the Age column and it turns out that the values are corrupted:

0      None
1      None
2      None
3      None
4      None
5      None
6      None
7      None
8      None
9      None
10     None
11     None
12     None
13     None
14     None
15     None
16     None
17     None
18     None
19     None
20     None
21     None
22     None
23     None
24     None
25     None
26     None
27     None
28     None
29     None
       ... 
861    None
862    None
863    None
864    None
865    None
866    None
867    None
868    None
869    None
870    None
871    None
872    None
873    None
874    None
875    None
876    None
877    None
878    None
879    None
880    None
881    None
882    None
883    None
884    None
885    None
886    None
887    None
888    None
889    None
890    None
Name: Age, Length: 891, dtype: object
Thibault Bacqueyrisses
  • 2,281
  • 1
  • 6
  • 18

2 Answers2

4

That's because you din't save the modifications of your dataframe with that line :

train['Sex'].replace(['female', 'male'], [0, 1])

Try to replace it by this :

train['sex'] = train['Sex'].replace(['female', 'male'], [0, 1])

Same for train['Embarked'].

Update

You don't need to do it for train['Age'], the fillna already modify the existant dataframe with the inplace=true.

Community
  • 1
  • 1
Thibault Bacqueyrisses
  • 2,281
  • 1
  • 6
  • 18
  • Could you check the edit. My age data doesn't exist. – Andros Adrianopolos Jun 05 '19 at 08:58
  • Excuse me i was a little confuse in my response, i updated it – Thibault Bacqueyrisses Jun 05 '19 at 09:06
  • Thank you. I've accepted and upvoted your answer. If you think that this was a well asked question, could you upvote me as well? – Andros Adrianopolos Jun 05 '19 at 09:28
  • 1
    Glad to have helped you ! Yes sure thing ;) – Thibault Bacqueyrisses Jun 05 '19 at 09:35
  • @Andros Adrianopolos Taking the "upvote me" comment as a request for feedback, I am downvoting the OP: not a [minimal](https://stackoverflow.com/help/minimal-reproducible-example) example and complains about an exception without providing the stack trace. – Leporello Jun 05 '19 at 09:37
  • Actually, i was able to reproduce the exemple and the error easily, so i don't think this is not fair. – Thibault Bacqueyrisses Jun 05 '19 at 09:41
  • @Leporello What utter nonsense. The code is already small so how minimal did you want it to be? Also, I've provided what the error was so why do you still need pages of stacktrace? Didn't you just complain about not having "minimal" code? – Andros Adrianopolos Jun 05 '19 at 09:43
  • The code is small, but not minimal: for instance, lines related to `train['Embarked']`, and `train['Age']` could have been scrapped when investigating your initial issue. As for the stack trace: the line at which the error occurs is at least as important as the error itself; scrapping that information makes the description _incomplete_, not _minimal_ (though I agree the error can be easily reproduced so it's not a big deal _in that exact case_). – Leporello Jun 05 '19 at 09:52
0

You just need to modify two lines:

train['Sex'].replace(['female', 'male'], [0, 1],inplace = True)
train['Embarked'].replace(['C', 'Q', 'S'], [1, 2, 3],inplace=True)

then it will work.