3

I would greatly appreciate if you could let me know how to use SMOTENC. I wrote:

# Data
XX = pd.read_csv('Financial Distress.csv')
y = np.array(XX['Financial Distress'].values.tolist())
y = np.array([0 if i > -0.50 else 1 for i in y])
Na = np.array(pd.read_csv('Na.csv', header=None).values)

XX = XX.iloc[:, 3:127]

# Use get-dummies to convert categorical features into dummy ones
dis_features = ['x121']
X = pd.get_dummies(XX, columns=dis_features)

# # Divide Data into Train and Test
indices = np.arange(y.shape[0])
X_train, X_test, y_train, y_test, idx_train, idx_test = train_test_split(X, y, indices, stratify=y, test_size=0.3,
                                                                         random_state=42)
num_indices=list(X)[:X.shape[1]-37]
cat_indices=list(X)[X.shape[1]-37:]
num_indices1 = list(X.iloc[:,np.r_[0:94,95,97,100:123]].columns.values)
cat_indices1 = list(X.iloc[:,np.r_[94,96,98,99,123:160]].columns.values)
print(len(num_indices1))
print(len(cat_indices1))

pipeline=Pipeline(steps= [
    # Categorical features
    ('feature_processing', FeatureUnion(transformer_list = [
            ('categorical', MultiColumn(cat_indices)),

            #numeric
            ('numeric', Pipeline(steps = [
                ('select', MultiColumn(num_indices)),
                ('scale', StandardScaler())
                        ]))
        ])),
    ('clf', rg)
    ]
)
pipeline_with_resampling = make_pipeline(SMOTENC(categorical_features=cat_indices1), pipeline)

# # Grid Search to determine best params
cv=StratifiedKFold(n_splits=5,random_state=42)
rg_cv = GridSearchCV(pipeline_with_resampling, param_grid, cv=cv, scoring = 'f1')
rg_cv.fit(X_train, y_train)

Therefore, as it is indicated I have 5 categorical features. Really, indices 123 to 160 are related to one categorical feature with 37 possible values which is converted into 37 columns using get_dummies. Unfortunately, it throws the following error:

Traceback (most recent call last):
  File "D:/mifs-master_2/MU/learning-from-imbalanced-classes-master/learning-from-imbalanced-classes-master/continuous/Final Logit/SMOTENC/logit-final - Copy.py", line 424, in <module>
    rg_cv.fit(X_train, y_train)
  File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py", line 722, in fit
    self._run_search(evaluate_candidates)
  File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py", line 1191, in _run_search
    evaluate_candidates(ParameterGrid(self.param_grid))
  File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py", line 711, in evaluate_candidates
    cv.split(X, y, groups)))
  File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 917, in __call__
    if self.dispatch_one_batch(iterator):
  File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 759, in dispatch_one_batch
    self._dispatch(tasks)
  File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 716, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 182, in apply_async
    result = ImmediateResult(func)
  File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 549, in __init__
    self.results = batch()
  File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 225, in __call__
    for func, args, kwargs in self.items]
  File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 225, in <listcomp>
    for func, args, kwargs in self.items]
  File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 528, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\imblearn\pipeline.py", line 237, in fit
    Xt, yt, fit_params = self._fit(X, y, **fit_params)
  File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\imblearn\pipeline.py", line 200, in _fit
    cloned_transformer, Xt, yt, **fit_params_steps[name])
  File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\externals\joblib\memory.py", line 342, in __call__
    return self.func(*args, **kwargs)
  File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\imblearn\pipeline.py", line 576, in _fit_resample_one
    X_res, y_res = sampler.fit_resample(X, y, **fit_params)
  File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\imblearn\base.py", line 85, in fit_resample
    output = self._fit_resample(X, y)
  File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\imblearn\over_sampling\_smote.py", line 940, in _fit_resample
    self._validate_estimator()
  File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\imblearn\over_sampling\_smote.py", line 933, in _validate_estimator
    ' should be between 0 and {}'.format(self.n_features_))
ValueError: Some of the categorical indices are out of range. Indices should be between 0 and 160

Thanks in advance.

sophros
  • 14,672
  • 11
  • 46
  • 75
ebrahimi
  • 912
  • 2
  • 13
  • 32
  • It seems that SMOTENC will internally use `get_dummy`. However, I also need to use `get_dummy` before splitting my dataset into train and test. Otherwise, this error is reported: `ValueError: Input contains NaN, infinity or a value too large for dtype('float64')` which is related to `rg_cv.fit(X_train, y_train)`. – ebrahimi Feb 01 '19 at 14:18
  • 1
    This question lacks a [mcve]. Making the effort to construct your question well will maximize the likelihood of getting useful answers. As it stands right now, I'd have to do way too much work on my own to understand what is going on. Maybe somebody else will know exactly what is going on but I do not. Also, please refrain from pinging several other high rep users on other questions to come answer your question when you haven't put in the requisite time to ask it appropriately. – piRSquared Feb 06 '19 at 14:19

4 Answers4

2

As it follows, two pipelines should be used:

num_indices1 = list(X.iloc[:,np.r_[0:94,95,97,100:120,121:123]].columns.values)
cat_indices1 = list(X.iloc[:,np.r_[94,96,98,99,120]].columns.values)
print(len(num_indices1))
print(len(cat_indices1))
cat_indices = [94, 96, 98, 99, 120]

from imblearn.pipeline import make_pipeline

pipeline=Pipeline(steps= [
    # Categorical features
    ('feature_processing', FeatureUnion(transformer_list = [
            ('categorical', MultiColumn(cat_indices1)),

            #numeric
            ('numeric', Pipeline(steps = [
                ('select', MultiColumn(num_indices1)),
                ('scale', StandardScaler())
                        ]))
        ])),
    ('clf', rg)
    ]
)
pipeline_with_resampling = make_pipeline(SMOTENC(categorical_features=cat_indices), pipeline)
ebrahimi
  • 912
  • 2
  • 13
  • 32
1

You can not dummies your categorical variables and use it later SMOTENC because it already implements in its algorithm get_dummies what will bias your model. However, I recommend using SMOTE () instead of SMOTENC (), but in this case you must first apply get_demmies.

Fallou
  • 11
  • 3
0

You cannot use scikit learn pipeline with imblearn pipeline. The imblearn pipeline implements fit_sample as well as fit_predict. Sklearn pipeline onle implements fit_predict. You cannot combine them.

Mainak Sen
  • 63
  • 6
0

First, don't do the get_dummies. Then, change the way you do your categorical_features, and put a list of booleans for if it's categorical or not.

Try this:

cat_cols = []
for col in x.columns:
    if x[col].dtype == 'object': #or 'category' if that's the case
        cat_cols.append(True)
    else:
        cat_cols.append(False)

Then pass cat_cols to your SMOTENC:

smote_nc = SMOTENC(categorical_features=cat_cols, random_state=0)