-1

I'm trying to solving Kaggle's Titanic with Python. But I have an error trying to fit my data. This is my code:

import pandas as pd
from sklearn import linear_model

def clean_data(data):
    data["Fare"] = data["Fare"].fillna(data["Fare"].dropna().median())
    data["Age"] = data["Age"].fillna(data["Age"].dropna().median())

    data.loc[data["Sex"] == "male", "Sex"] = 0
    data.loc[data["Sex"] == "female", "Sex"] = 1

    data.loc["Embarked"] = data["Embarked"].fillna("S")
    data.loc[data["Embarked"] == "S", "Embarked"] = 0
    data.loc[data["Embarked"] == "C", "Embarked"] = 1
    data.loc[data["Embarked"] == "Q", "Embarked"] = 2

train = pd.read_csv("train.csv")

clean_data(train)

target = train["Survived"].values
features = train[["Pclass", "Age","Sex","SibSp", "Parch"]].values

classifier = linear_model.LogisticRegression()
classifier_ = classifier.fit(features, target) # Here is where error comes from

And the error is this:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Can you help me please?

  • Well, what `NaN` data are in your table? It appears that you didn't preprocess the input as required by the `fit` function. – Prune Oct 03 '18 at 18:30
  • You may use `train.dropna(inplace=True)` to drop the NANs in your dataframe. For more details refer: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html – Ankita Mehta Oct 03 '18 at 22:51

3 Answers3

0

Before you fit the model with features and target, the best practice is to check whether the null value is present in all the features which you want to use in building the model. You can know the below to check it

dataframe_name.isnull().any() this will give the column names and True if atleast one Nan value is present

dataframe_name.isnull().sum() this will give the column names and value of how many NaN values are present

By knowing the column names then you perform cleaning of data. This will not create the problem of NaN.

0

You should reset the index of your dataframe before running any sklearn code:

df = df.reset_index()

D. Wei
  • 79
  • 1
  • 8
0

Nan simply represents empty,None or null values in a dataset. Before applying some ML algorithm on the dataset you, first, need to preprocess the dataset for it's streamlined processing. In other words it's called data cleaning. you can use scikit learn's imputer module to handle Nan.

How to check if dataset has Nan:
dataframe's isnan() returns a list of True/False values to show whether some column contains Nan or not
for example:

  str = pd.Series(['a','b',np.nan, 'c', 'np.nan'])
  str.isnull()
  out: False, False, True, False, True


And str.isnull().sum() would return you the count of null values present in the series. In this case '2'. you can apply this method on a dataframe itself e.g. df.isnan()

Two techniques I know to handle Nan:
1. Removing the row which contains Nan.
e.g. str.dropna() orstr.dropna(inplace=True) or df.dropna(how=all)
But this would remove many valuable information from the dataset. Hence, mostly we avoid it.
2.Imputing: replacing the Nan values with the mean/median of the column.

 from sklearn.preprocessing import Imputer
 imputer = Imputer(missing_values='NaN', strategy='mean', axis=0) 
 #strategy can also be median or most_frequent 
 imputer = imputer.fit(training_data_df)
 imputed_data = imputer.fit_transform(training_data_df.values)
 print(imputed_data_df)

I hope this would help you.

log0
  • 2,206
  • 2
  • 14
  • 24