3

I am working on the titanic dataset and when trying to apply OneHotEncoding on one of the columns called 'Embarked' which has 3 possible values 'S','Q' and 'C'. It gives me the

ValueError: Input contains NaN

I checked the contents of the column by using 2 methods. The first one being the for-loop with value_counts and the second one by writing the entire table to a csv:

for col in X.columns:
    print(col)
    print(X[col].value_counts(dropna=False))
X.isna().to_csv("xisna.csv")
print("notna================== :",X.notna().shape)
X.dropna(axis=0,how='any',inplace=True)
print("X.shape " ,X.shape)
return pd.DataFrame(X)

Which yielded

Embarked
S    518
C    139
Q     55
Name: Embarked, dtype: int64

I checked the contents of csv and reading through the over 700 entries, I did not find any 'True'-statement.

The pipeline that blocks at the ("cat",One...)

cat_attribs=["Sex","Embarked"]
special_attribs = {'drop_attribs' : ["Name","Cabin","Ticket","PassengerId"], k : [3]}

full_pipeline = ColumnTransformer([
    ("fill",fill_pipeline,list(strat_train_set)),
    ("emb_cat",OneHotEncoder(),['Sex']),
    ("cat",OneHotEncoder(),['Embarked']),
])

So where exactly is the NaN-value that I am missing?

BURNS
  • 711
  • 1
  • 9
  • 20
  • 1
    Which titanic dataset are you using? I tried using `X['Embarked'].value_counts(dropna=False)` on the `train.csv` dataset downloaded from the [kaggle competition](https://www.kaggle.com/c/titanic/data), and I obtained this result: `S 644 C 168 Q 77 NaN 2 Name: Embarked, dtype: int64`. There are indeed two `NaN` values at the `PassengerId` 62 and 830 – Ric S Jul 14 '20 at 06:58
  • I used the one directly from Kaggle. I have split the dataset into a training_set and test_set. And applied another function to fill in the NaN's. The summary you see in my post is solely from my training set, which is also the only set that I try to transform. – BURNS Jul 14 '20 at 07:02
  • 1
    Have you tried this only on your train set? Is it possible that the `NaN` values are only in the test set? – Ric S Jul 14 '20 at 07:06
  • Yes, it is only on the training set because the amount of records is 0.8 times the original set and the csv contains those amount of records. And I see in between my own print statements that during the transformation of the training set, I discovered the NaN embarked cells and transformed them to Non-Nan values – BURNS Jul 14 '20 at 07:20

1 Answers1

0

I figured it out, a ColumnTransformer will concatenate the transformations instead of passing them along to the next transformer in line. So any transformations done in fill_pipeline, won't be noticed by OneHotEncoder since it is still working with the untransformed dataset. So I had to put the one hot encoding into the fill_pipeline instead of the ColumnTransformer.

full_pipeline = ColumnTransformer([
    ("fill",fill_pipeline,list(strat_train_set)),
    ("emb_cat",OneHotEncoder(),['Sex']),
    ("cat",OneHotEncoder(),['Embarked']),
])
BURNS
  • 711
  • 1
  • 9
  • 20
  • Sorry but code in your answer is same as that in question, didn't you mean that you made separate pipeline for OHE? – Deep Jul 27 '21 at 07:20