How can I generate categorical synthetic samples with imblearn and SMOTE?

Question

I am looking to generate synthetic samples for a machine learning algorithm using imblearn's SMOTE. I have a few categorical features which I have converted to integers using sklearn preprocessing.LabelEncoder.

The problem that I have is that when I use smote to generate synthetic data, the datapoints become floats and not integers which I need for the categorical data.

from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.decomposition import PCA
import numpy as np
from sklearn import preprocessing

if __name__ == '__main__':
    df = pd.read_csv('resample.csv')
    y = df['Result']
    accounts = df['Account Number']
    df.drop('Result',axis=1,inplace=True)
    df.drop('Account Number', axis=1, inplace=True)

    df.fillna(value=0, inplace=True)

    le = preprocessing.LabelEncoder()
    le.fit(df['Distribution Partner'])
    print(le.classes_)
    df['Distribution Partner'] = le.transform(df['Distribution Partner'])
    print('Original dataset shape {}'.format(Counter(y)))
    sm = SMOTE(kind='regular')
    X_resampled, y_resampled = sm.fit_sample(df, y)
    np.savetxt('output.csv', X_resampled, delimiter=",")
    print('New dataset shape {}'.format(Counter(y_resampled)))

Is there anyway which I can get SMOTE to generate synthetic samples but only with values which are 0,1,2 etc instead of 0.5,1.23,2.004?

score 3 · Answer 1 · answered May 14 '19 at 12:29

3

Is quite simple: Use SMOTENC instead of SMOTE. SMOTENC can handle both categorical and non categorical feature.

Sample Code:

from imblearn.over_sampling import SMOTENC`
obj = SMOTENC(categorical_features = [1,4])
ovsersampled_features, ovsersampled_target = obj.fit_sample(Features, Target)

[1,4] = index of categorical column of the data set.

*index starts from 0.

answered May 14 '19 at 12:29

vipin bansal

878
11
10

Hi, I have a similar problem. If you have time, can request your help with this related post? https://stackoverflow.com/questions/71193740/typeerror-encoders-require-their-input-to-be-uniformly-strings-or-numbers-got – The Great Feb 20 '22 at 11:13

score 0 · Answer 2 · answered Mar 03 '17 at 18:58

0

Unfortunately imblearn's SMOTE implementation is only for continuous data. It is discussed here.

answered Mar 03 '17 at 18:58

artex

1,776
2
11
16

How can I generate categorical synthetic samples with imblearn and SMOTE?

2 Answers2