Scikit learn how to change a categorical value with missing data to a numerical one

Question

I am using sklearn for a machine learning project, and one of the columns is in categorical form. I would like to convert it into numerical form with an ordinal encoder, and then impute the missing data. Sklearn's OrdinalEncoder throws an error:

ValueError: Input contains NaN

but I would really rather not use the categorical imputer first and then convert the values into numbers, because it is much less suited to the nature of the data. Is there any way around this?

here is the code:

from sklearn.preprocessing import OrdinalEncoder
ordinalenc = OrdinalEncoder()
imd = ordinalenc.fit_transform(info[["imd_band"]])
print(ordinalenc.categories_)

score 0 · Answer 1 · answered Apr 04 '20 at 17:01

0

Documented inline

from sklearn.preprocessing import OrdinalEncoder
import pandas as pd
from sklearn.impute import SimpleImputer

df = pd.DataFrame({'x': ['a','b','b',np.NaN]*3})
ordinalenc = OrdinalEncoder()
# Catagorial to Ordinal of only not NAN values
df.loc[df['x'].notnull(), 'new_x']  = ordinalenc.fit_transform(df[df['x'].notnull()])
# Now impute 
im = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
df['new_x'] = im.fit_transform(df['new_x'].values.reshape(-1, 1))
print (df)

Output

    x   new_x
0   a   0.0
1   b   1.0
2   b   1.0
3   NaN 1.0
4   a   0.0
5   b   1.0
6   b   1.0
7   NaN 1.0
8   a   0.0
9   b   1.0
10  b   1.0
11  NaN 1.0

answered Apr 04 '20 at 17:01

mujjiga

16,186
2
33
51

unfortunately i'm new at this and i'm not sure I understand what the equivalent of x in my dataframe is. Im using the open university dataset https://analyse.kmi.open.ac.uk/open_dataset so i have not created my own dataframes. I understand from the documentation that .loc accesses a group of columns by label. I have the impression that labels are not the same thing as column names, and certainly none of the column names worked when i put them in place of x. Is there a way to select this data by column and not by labels? – plotka Apr 04 '20 at 21:12
i.e. this line: ```info.loc[info['imd_band'].notnull(), 'new_x'] = ordinalenc.fit_transform(info[info['imd_band'].notnull()])``` produces a ```ValueError: Must have equal len keys and value when setting with an ndarray``` – plotka Apr 04 '20 at 21:20

Scikit learn how to change a categorical value with missing data to a numerical one

1 Answers1