3

I am working on multi-class classification problem having five classes in the target column. I have generated features for categorical variables using expanding mean encoding(Target encoding). The method is based on encoding categorical variable values with mean of target variable per value.

This also results in some NaN values like in 'Transaction-Type_mean_target' column.

  1. What is the best way to fill these NaN values? Should I fill these with the column mean.

  2. How do I generate mean encoding for my test data as the target/Dependent variable 'Complaint-Status' is not present?

Input data :

enter image description here

Generating mean encoding :

def add_feat_mean_encoding(col_list):
    """
        Expanding mean encoding 
    """
    for i in col_list:
        cumsum = train.groupby(i)['Complaint-Status'].cumsum() - train['Complaint-Status']
        cumcnt = train.groupby(i).cumcount()
        train[i+'_mean_target'] = cumsum/cumcnt

cat_var = ['Transaction-Type','Complaint-reason','Company-response','Consumer-disputes']
add_feat_mean_encoding(cat_var)
joel
  • 1,156
  • 3
  • 15
  • 42
  • Hallo, just to understand: Which among your features are categorical and which are scalar? CmpantStatus is a scalar or categorical (boolean perhaps, 0/1) ? – Catalina Chircu Feb 17 '20 at 06:04

1 Answers1

1

If your features are categorical, imputing with mean does not make sense, at least it means that you create a new value for Nans. is that what you want ?

In order to answer your questions :

  1. For categorical features, you may try different ways. You may start with SimpleImputer of scikit (see here), try for example :
FILL_VALUE=100
imp_1 = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imp_2 = SimpleImputer(missing_values=np.nan, strategy='constant', fille_value=FILL_VALUE)
  1. Please provide information on your testset : which are the features ? If you do not have the feature Complaint-Status in the testset, there are two ways :

    • You can predict the feature Complaint-Status, for the test set, using the training set (and the feature you wish to predict, Complaint-Status, as Y). Try different classifiers and choose the one which provides the best result.
    • You can also use a 2 dimension Y, with your actual Y + Complaint-Status.
Catalina Chircu
  • 1,506
  • 2
  • 8
  • 19