0

I have a dataframe with column grade which contains categorical values. My problem result in the fact, that the type of the values are float and not object.

import pandas as pd
import numpy as np

df = pd.DataFrame(
 {
 "key": ["K0", "K1", "K2", "K3", "K4"],
 "grade": [1.0, 2.0, 2.0, np.nan, 3.0],
 }
)

df = 
   key  grade
0   K0  1.0
1   K1  2.0
2   K2  2.0
3   K3  NaN
4   K4  3.0

I have missing values in column grade. I want to impute missing values with most frequent values by using feature-engine which is based on sklearn. Feature-engine includes widely used missing data imputation methods, such as mean and median imputation, frequent category imputation, random sample imputation.

Install and load library:

! pip install feature-engine

from feature_engine.imputation import CategoricalImputer

Apply imputer:

# set up the imputer
imputer = CategoricalImputer(variables=['grade'], imputation_method='frequent')

# fit the imputer
imputer.fit(df)

# transform the data
df = imputer.transform(df)

df.head()

I get the following TypeError:

TypeError: Some of the variables are not categorical. Please cast them as object before calling this transformer

I understand the error but I don't understand why it appears. According to the docs, feature-engine can handle numerical variables with this transformer.

My questions are:

  1. How can I fix this by using the same transformer? Did I misunderstood the docs?
  2. If this transformer doesn't work, what other solutions do you suggest?
PParker
  • 1,419
  • 2
  • 10
  • 25

2 Answers2

2

Just change the dtype of grade column to object before using imputer,

df = pd.DataFrame(
 {
 "key": ["K0", "K1", "K2", "K3", "K4"],
 "grade": [1.0, 2.0, 2.0, np.nan, 3.0],
 }
)

df["grade"] = df.grade.astype("object")

imputer = CategoricalImputer(variables=['grade'], imputation_method='frequent')
imputer.fit(df)
df = imputer.transform(df)

df.head()

    key  grade
0   K0   1.0
1   K1   2.0
2   K2   2.0
3   K3   2.0
4   K4   3.0

If you prefer dtype of grade to be string/object after imputing use,

imputer = CategoricalImputer(variables=['grade'],
                             imputation_method='frequent',
                             return_object=True)

# this returns

    key  grade
0   K0   1
1   K1   2
2   K2   2
3   K3   2
4   K4   3 
Abhi
  • 4,068
  • 1
  • 16
  • 29
0

The CategoricalImputer is intended to impute categorical variables only. That is why, by default it works only on variables of type object or categorical.

However, there are cases, where variables that are numerical in value, want to be treated as categorical. In older versions of the package, in order to do so, we needed to change the format of the variable to object as described by Abhi.

As of version 1.1, you can impute numerical variables with the CategoricalImputer straightaway by setting the parameter ignore_format=True within the transformer.

Sole Galli
  • 827
  • 6
  • 21