Treat missing data as just another category

Asked May 31 '16 at 19:55

Active Jun 03 '16 at 16:16

Viewed 307 times

I have some data which is mostly user demographics. There are lot of survey questions which people have answered "yes" or "no". But the data naturally contains lots of missing values. I don't want to impute the missing values. I want to treat that as a third category. So each question has three possible answers - "Yes", "No" and "NotSure".

What I am doing till now is :

model = graphlab.boosted_trees_classifier.create(train,
validation_set=None, target = target, max_iterations = 80, verbose = False)

where target is what I am predicting (It is binary 1 or -1). Now both my train and test dataset has lot of missing values so for that what I was doing till now is:

predictions = model.predict(test, missing_value_action='impute')

But these predictions are not giving me good accuracy. I want to convert each two category answer (Yes/No) to three category (Yes/No/NotSure). How to go about doing that?

I tried :

colNames = train.column_names()
for i in colNames[6:]:
    train.fillna(i,'NotSure')

This executes without any error but it doesn't work.

edited Jun 03 '16 at 16:16

asked May 31 '16 at 19:55

Karup

2,024
3
22
48

Sorry are you asking how to do `df.fillna('NotSure')`? – EdChum May 31 '16 at 19:56
@EdChum Something like that but for multiple columns and graphlab's syntax for doing that. – Karup May 31 '16 at 20:26
You have to assign back to the column or pass param `inplace =True` – EdChum May 31 '16 at 20:28
but it doesn't supports `inplace` argument :( – Karup May 31 '16 at 20:29
Oh thanks! Got it `train=train.fillna(i,'NotSure')` :) – Karup May 31 '16 at 20:29
semantically it should be `train[i] = train[i].fillna('NotSure')` also I think `train[colNames] = train[colNames'].fillna('NotSure')` should work – EdChum May 31 '16 at 20:31

Treat missing data as just another category

0 Answers0