-1

I have a data set on police killings that you can find on Kaggle. There's some missing data in several columns:

UID                0.000000
Name               0.000000
Age                0.018653
Gender             0.000640
Race               0.317429
Date               0.000000
City               0.000320
State              0.000000
Manner_of_death    0.000000
Armed              0.454487
Mental_illness     0.000000
Flee               0.000000
dtype: float64

I created a copy of the original df to encode it and then impute missing values. My plan was:

  1. Label encode all categorical columns:
Index(['Gender', 'Race', 'City', 'State', 'Manner_of_death', 'Armed',
       'Mental_illness', 'Flee'],
      dtype='object')
le = LabelEncoder()
lpf = {}
for col in lepf.columns:    
    lpf[col] = le.fit_transform(lepf[col])
lpfdf = pd.DataFrame(lpf)

Now I have my dataframe with all categories encoded.

  1. Then, I located those nan values in the original dataframe (pf), to substitute those encoded nan's in lpfdf:
for col in lpfdf:
    print(col,"\n",len(np.where(pf[col].to_frame().isna())[0]))

Gender 8
Race 3965
City 4 State 0 Manner_of_death 0 Armed 5677 Mental_illness 0
Flee 0

For instance, Gender got three encoded labels: 0 for Male, 1 for Female, and 2 for nan. However, the feature City had >3000 values, and it was not possible to locate it using value_counts(). For that reason, I used:

np.where(pf["City"].to_frame().isna())

Which yielded:

(array([ 4110, 9093, 10355, 10549], dtype=int64), array([0, 0, 0, 0], dtype=int64))

Looking to any of these rows corresponding to the indices, I saw that the nan label for City was 3327:

lpfdf.iloc[10549]

Gender                1
Race                  6
City               3327
State                10
Manner_of_death       1
Armed                20
Mental_illness        0
Flee                  0
Name: 10549, dtype: int64

Then I proceded to substitute these labels for np.nan:

"""
Gender: 2,
Race: 6,
City: 3327,
Armed: 59

"""
lpfdf["Gender"] = lpfdf["Gender"].replace(2, np.nan)
lpfdf["Race"] = lpfdf["Race"].replace(6, np.nan)
lpfdf["City"] = lpfdf["City"].replace(3327, np.nan)
lpfdf["Armed"] = lpfdf["Armed"].replace(59, np.nan)
  1. Create the instance of iterative imputer and then fit and transform lpfdf:
itimp = IterativeImputer()
iilpf = itimp.fit_transform(lpfdf)

Then make a dataframe for these new imputed values:

itimplpf = pd.DataFrame(np.round(iilpf), columns = lepf.columns)

And finally, when I go to inveres transform to see the corresponding labels it imputed I get the following error:

for col in lpfdf:    
    le.inverse_transform(itimplpf[col].astype(int))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-191-fbdde4bb4781> in <module>
      1 for col in lpfdf:
----> 2     le.inverse_transform(itimplpf[col].astype(int))

~\anaconda3\lib\site-packages\sklearn\preprocessing\_label.py in inverse_transform(self, y)
    158         diff = np.setdiff1d(y, np.arange(len(self.classes_)))
    159         if len(diff):
--> 160             raise ValueError(
    161                     "y contains previously unseen labels: %s" % str(diff))
    162         y = np.asarray(y)

ValueError: y contains previously unseen labels: [2 3 4 5]

What is wrong with my steps? Sorry for my long-winded explanation but I felt that I need to explain all the steps so that you can understand the issue properly. Thank you all.

Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
  • 1
    Two comments that might help you: 1- Why are you using `LabelEncoder` and not `OneHotEncoder`? Consider it. 2- Are you labeling NaN values? That's probably not the best procedure. You should first deal with it, maybe using: `SimpleImputer`, and then, encode. – Alex Serra Marrugat Jul 29 '21 at 11:14
  • Thanks for your suggestions, Alex. First off, OneHotEncoder is harder for me to work with, especially in this case where I would have to work with >3000 new columns. And as I've told afsharov, SimpleImputer does not meet the requirements for me, cause my data in such case would not reflect reality at all e.g., in Armed, the most-common label would have now the remaining 45% of observations that were NaN originally. It would not reflect the reality. – Mario Aguilar Jul 29 '21 at 15:14
  • One suggestion to deal with NaN values could be using KNN Imputation, maybe this techniques deal better with you requirements. About `LabelEncoder` (almost equal to `OrdinalEncoder`), it will encode your data with "order", this means that it will assume that 2>1>0...., and for most of categorical variables this is not true. Be super careful because the algorithm will assume this relation. One use example `LabelEncoder` is encode: Very bad, bad, good, very good --> 0,1,2,3. But I assume this is not your case. – Alex Serra Marrugat Jul 30 '21 at 06:13

3 Answers3

1

Your approach of encoding categorical values first and then imputing missing values is prone to problems and thus, not recommended.

Some imputing strategies, like IterativeImputer, will not guarantee that the output contains only previously known numeric values . This can result in imputed values which are unknown to the encoder and will cause an error upon the inverse transformation (which is exactly your case).

It is better to first impute the missing values for both, numeric and categorical features, and then encode the categorical features. One option would be to use SimpleImputer and replacing missing values with the most frequent category or a new constant value.


Also, a note on LabelEncoder: it is clearly mentioned in its documentation that:

This transformer should be used to encode target values, i.e. y, and not the input X.

If you insist on an encoding strategy like LabelEncoder, you can use OrdinalEncoder which does the same but is actually meant for feature encoding. However, you should be aware that such an encoding strategy might falsely suggest an ordinal relationship between each category of a feature, which might lead to undesired consequences. You should therefore consider other encoding strategies as well.

afsharov
  • 4,774
  • 2
  • 10
  • 27
  • I know man, but using SimpleImputer to replace nan for the most frequent category is not what I am looking for, since for features like Race (31%) or Armed (45%) it would be biasing the data too much. Regarding OrdinalEncoder, as you've said, I do not want sklearn to see labels as following a certain order, that is why I chose LabelEncoder to for-loop through columns so that it is a Series, like it was my target y. – Mario Aguilar Jul 29 '21 at 15:09
1

A possibility that might be worth exploring is predicting missing categorical (encoded) values using a machine learning algorithm e.g. sklearn.ensemble.RandomForestClassifier.

Here, you would train a multiclass classification model for predicting missing values of each of your columns. You'd start by replacing missing values with a magic value (e.g -99), and then one-hot encode them. Next, train a classification model to predict the categorical value of a chosen column, using the one-hot encoded values of the other columns as training data. The training data would, of course, exclude rows where the column to be predicted is missing. Finally, compose a "test" set made from the rows where this column is missing, predict the values, and impute these values into the column. Repeat this for each column that needs to have missing values imputed.

Assuming you want to apply machine learning techniques to this data at a later point, a deeper question is whether the absence of values in some examples of the dataset may in fact carry useful information for predicting your Target, and consequently, whether a particular imputation strategy could corrupt that information.

Edit: Below is an example of what I mean, using dummy data.

import numpy as np
import sklearn
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
#from catboost import CatBoostClassifier

# create some fake data
n_samples = 1000
n_features = 20
features_og, _ = make_classification(n_samples=n_samples, n_features=n_features,n_informative=3, n_repeated= 16, n_redundant = 0)

# convert to fake categorical data
features_og = (features_og*10).astype(int)

# add missing value flag (-99) at random
features = features_og.copy()
for i in range(n_samples):
    for j in range(n_features):    
        if np.random.random() > 0.85:
            features[i,j] = -99

# go column by column predicting and replacing missing values
features_fixed = features.copy()
for j in range(n_features):   
    # do train test split based on whether the selected column value is -99.
    train = features[np.where(features[:,j] != -99)]
    test = features[np.where(features[:,j] == -99)]

    clf = RandomForestClassifier(n_estimators=300, max_depth=5, random_state=42)
    
    # potentially better for categorical features is CatBoost:
    #clf = CatBoostClassifier(n_estimators= 300,cat_features=[identify categorical features here])
    
    # train the classifier to predict the value of column j using the other columns
    clf.fit(train[:,[x for x in range(n_features) if x != j]], train[:,j])
    
    # predict values for elements of column j that have the missing flag
    preds = clf.predict(test[:,[x for x in range(n_features) if x != j]])
    
    # substitute the missing values in column j with the predicted values
    features_fixed[np.where(features[:,j] == -99.),j] = preds
  • Hey Andrew thanks for your suggestion. If I understood correctly, the steps would be: 1) Replace np.nan with -99. 2) Split the dataframe into train set (which doesn't contain -99) and test set (which does). 3) Fit the OH encoder to the train set and transform it. 4) Fit the classifier model to the training set. 5) Use the fitted OH encoder to transform the test set. 6) Predict on such set with the trained classifier. I have been trying these procedure as shown but got stuck in the last step due to mismatch between train and test sizes. – Mario Aguilar Jul 31 '21 at 18:30
  • Hi Mario, I've added a code example which I hope will help. – Andrew Humphrey Aug 01 '21 at 23:13
  • Man, it is awesome. Thank you very much! You don't know how much I appreciate this answer and how helpful and light-shedding it's been. – Mario Aguilar Aug 03 '21 at 08:20
0

The entire process can be automated with the datawig package.You just need to create an imputation model for each to-be-imputed column and it will handle the encoding and inverse encoding by itself.

It was even tested against kNN and iterative imputer and showed better results.

Here is a personal guide.