I am working on the titanic dataset and when trying to apply OneHotEncoding on one of the columns called 'Embarked' which has 3 possible values 'S','Q' and 'C'. It gives me the
ValueError: Input contains NaN
I checked the contents of the column by using 2 methods. The first one being the for-loop with value_counts and the second one by writing the entire table to a csv:
for col in X.columns:
print(col)
print(X[col].value_counts(dropna=False))
X.isna().to_csv("xisna.csv")
print("notna================== :",X.notna().shape)
X.dropna(axis=0,how='any',inplace=True)
print("X.shape " ,X.shape)
return pd.DataFrame(X)
Which yielded
Embarked
S 518
C 139
Q 55
Name: Embarked, dtype: int64
I checked the contents of csv and reading through the over 700 entries, I did not find any 'True'-statement.
The pipeline that blocks at the ("cat",One...)
cat_attribs=["Sex","Embarked"]
special_attribs = {'drop_attribs' : ["Name","Cabin","Ticket","PassengerId"], k : [3]}
full_pipeline = ColumnTransformer([
("fill",fill_pipeline,list(strat_train_set)),
("emb_cat",OneHotEncoder(),['Sex']),
("cat",OneHotEncoder(),['Embarked']),
])
So where exactly is the NaN-value that I am missing?