0

I want to find missing values in a dataframe with both categorical and numerical data. As for the categorical data, one column takes into account the order (here column named "category_with_order"), and the other don't (here column named "category_without_order"). "sugar" and "salt" columns take numerical data as input.

Finally I want to de-encode the whole to retrieve my original headers and the new dataframe imputed with KNN imputer.

Here is my initial dataframe: enter image description here

What I did is I "One Hot encoded" the categories, but with no distinction between the two categorical columns and I now have two "Nan" columns which is weird :enter image description here

and then I did a concatenation of my initial dataframe with the categorical variables encoded:

enter image description here

Here is the code I have been using to output these dataframes:

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.impute import KNNImputer

data = {
    'category_with_order': ['a', 'b', 'c','d',np.nan],
    'category_without_order': ['plant',np.nan,'salad','meat', 'drinks'],
    'sugar': ['1',np.nan, '2', '2',np.nan],
    'salt': ['1',np.nan, '2', '1',np.nan]
}

df = pd.DataFrame(data1)

ohe = OneHotEncoder()
feature_array = ohe.fit_transform(df[["category_with_order","category_without_order"]]).toarray()
features_labels = ohe.categories_
feature_labels = np.hstack([i.ravel() for i in features_labels])
features = pd.DataFrame(feature_array, columns = feature_labels)
knn = KNNImputer(n_neighbors=1, add_indicator = True)
df_new = pd.concat([df1.reset_index(drop=True), features.reset_index(drop=True)], axis=1)
yoopiyo
  • 187
  • 9
  • you probably have string "nans". Can you check it ? or use something like this: `replace("NaN",np.nan)` – Bushmaster Nov 18 '22 at 09:11
  • In the definition of the dataframe "data" it is np.nan already. So I have "np.nan"s in the columns that are numericals and in categorical columns. So I assume they are not considered as nan strings. It is when I "OneHotEncode" the categories that I don't know what to do with the new "Nan" columns – yoopiyo Nov 18 '22 at 09:31
  • so I tried replace("NaN",np.nan) and it gives the same as my NaNs are already set to np.nan – yoopiyo Nov 18 '22 at 10:25
  • Start by prolly smthn like `df = df.dropna()` – Gautam Chettiar Nov 18 '22 at 10:53
  • i want to keep the NaNs, I want to impute my categorical data as well as my numerical data. But I should have all rows from the respective catagorical data input set to NaNs instead of new col NaNs I guess – yoopiyo Nov 18 '22 at 11:01
  • First, impute the data and then go for OHE. **PS:** why ordinal categorical data is transformed to OHE rather then ordinal encoding which will keep order/sequence/priority in check? – GodWin1100 Nov 18 '22 at 11:24
  • I can't impute non numerical data, that's why I need OHE first and then impute categorical data. Yes I guess for ordinal categories I will have to try ordinal encoding – yoopiyo Nov 18 '22 at 11:28
  • Refer to this [SO](https://stackoverflow.com/questions/64900801/implementing-knn-imputation-on-categorical-variables-in-an-sklearn-pipeline) which will clear your doubts and limitation of KNNImputer and how to tackle it and different approaches. – GodWin1100 Nov 19 '22 at 11:41

0 Answers0