I have a data set on police killings that you can find on Kaggle. There's some missing data in several columns:
UID 0.000000 Name 0.000000 Age 0.018653 Gender 0.000640 Race 0.317429 Date 0.000000 City 0.000320 State 0.000000 Manner_of_death 0.000000 Armed 0.454487 Mental_illness 0.000000 Flee 0.000000 dtype: float64
I created a copy of the original df to encode it and then impute missing values. My plan was:
- Label encode all categorical columns:
Index(['Gender', 'Race', 'City', 'State', 'Manner_of_death', 'Armed', 'Mental_illness', 'Flee'], dtype='object')
le = LabelEncoder()
lpf = {}
for col in lepf.columns:
lpf[col] = le.fit_transform(lepf[col])
lpfdf = pd.DataFrame(lpf)
Now I have my dataframe with all categories encoded.
- Then, I located those nan values in the original dataframe (pf), to substitute those encoded nan's in lpfdf:
for col in lpfdf: print(col,"\n",len(np.where(pf[col].to_frame().isna())[0]))
Gender 8
Race 3965
City 4 State 0 Manner_of_death 0 Armed 5677 Mental_illness 0
Flee 0
For instance, Gender got three encoded labels: 0 for Male, 1 for Female, and 2 for nan. However, the feature City had >3000 values, and it was not possible to locate it using value_counts(). For that reason, I used:
np.where(pf["City"].to_frame().isna())
Which yielded:
(array([ 4110, 9093, 10355, 10549], dtype=int64), array([0, 0, 0, 0], dtype=int64))
Looking to any of these rows corresponding to the indices, I saw that the nan label for City was 3327:
lpfdf.iloc[10549]
Gender 1
Race 6
City 3327
State 10
Manner_of_death 1
Armed 20
Mental_illness 0
Flee 0
Name: 10549, dtype: int64
Then I proceded to substitute these labels for np.nan:
"""
Gender: 2,
Race: 6,
City: 3327,
Armed: 59
"""
lpfdf["Gender"] = lpfdf["Gender"].replace(2, np.nan)
lpfdf["Race"] = lpfdf["Race"].replace(6, np.nan)
lpfdf["City"] = lpfdf["City"].replace(3327, np.nan)
lpfdf["Armed"] = lpfdf["Armed"].replace(59, np.nan)
- Create the instance of iterative imputer and then fit and transform lpfdf:
itimp = IterativeImputer() iilpf = itimp.fit_transform(lpfdf)
Then make a dataframe for these new imputed values:
itimplpf = pd.DataFrame(np.round(iilpf), columns = lepf.columns)
And finally, when I go to inveres transform to see the corresponding labels it imputed I get the following error:
for col in lpfdf:
le.inverse_transform(itimplpf[col].astype(int))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-191-fbdde4bb4781> in <module>
1 for col in lpfdf:
----> 2 le.inverse_transform(itimplpf[col].astype(int))
~\anaconda3\lib\site-packages\sklearn\preprocessing\_label.py in inverse_transform(self, y)
158 diff = np.setdiff1d(y, np.arange(len(self.classes_)))
159 if len(diff):
--> 160 raise ValueError(
161 "y contains previously unseen labels: %s" % str(diff))
162 y = np.asarray(y)
ValueError: y contains previously unseen labels: [2 3 4 5]
What is wrong with my steps? Sorry for my long-winded explanation but I felt that I need to explain all the steps so that you can understand the issue properly. Thank you all.