0

This is my code below, I need to use SMOTENC to balance the dataset, which means I have to use the pipeline from the imblearn library. However, it does not recognize the CountVectorizer features

from imblearn.pipeline import Pipeline 
# from sklearn.pipeline import Pipeline

 vectorizer_params = dict(ngram_range=(1, 2), min_df=200, max_df=0.8)

 categorical_features = ['F1','F2','F3','F4']
 categorical_transformer = OneHotEncoder(handle_unknown="ignore")

 textual_feature = ['F5']
 text_transformer = Pipeline(
    steps=[
        ("squeez", FunctionTransformer(lambda x: x.squeeze())),
        ("vect", CountVectorizer(**vectorizer_params)),
        ("tfidf", TfidfTransformer()),
        ("toarray", FunctionTransformer(lambda x: x.toarray())),
    ]
    )

preprocessor = ColumnTransformer(
    transformers=[
        ("cat", categorical_transformer, categorical_features),
        ("txt", text_transformer, textual_feature),
    ]
)

sgd_log_pipeline = Pipeline(
    [
        ("preprocessor", preprocessor),
        ('smote', SMOTENC(random_state=11,categorical_features=[4,5,6,7])),
        ("clf", SGDClassifier()),
    ]
)
  • Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. – Community Jun 06 '22 at 06:11

1 Answers1

0

Since you are using SMOTENC there's no need to do one hot encoding. You can check out the source code of and will see that it performs a one hot encoding on the categorical features you provided.

One solution is to ordinal encode your categorical features, and have SMOTENC treat these as categorical.

Using an example dataset:

from sklearn.preprocessing import OneHotEncoder, FunctionTransformer, OrdinalEncoder
from imblearn.pipeline import Pipeline 
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.compose import ColumnTransformer
from imblearn.over_sampling import SMOTENC
from sklearn.linear_model import SGDClassifier

from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train',categories=['rec.autos','sci.space','comp.graphics'])

n = len(newsgroups_train.data)
X = pd.DataFrame(np.random.choice(['A','B','C'],(n,4)),columns = ['F1','F2','F3','F4'])
X['F5'] = newsgroups_train.data

y = (newsgroups_train.target == 2).astype(int)

Your vectorizer part:

vectorizer_params = dict(ngram_range=(1, 2), min_df=200, max_df=0.8)

textual_feature = ['F5']
text_transformer = Pipeline(
    steps=[
        ("squeez", FunctionTransformer(lambda x: x.squeeze())),
        ("vect", CountVectorizer(**vectorizer_params)),
        ("tfidf", TfidfTransformer()),
        ("toarray", FunctionTransformer(lambda x: x.toarray())),
    ]
    )

Use an ordinal encoder instead of onehot :

categorical_features = ['F1','F2','F3','F4']
categorical_transformer = OrdinalEncoder()

The rest of the pipeline, and for the input parameter categorical_features= for SMOTENC we would put the first 4 columns, since you have 4 categorical features:

preprocessor = ColumnTransformer(
    transformers=[
        ("cat", categorical_transformer, categorical_features),
        ("txt", text_transformer, textual_feature),
    ]
)

sgd_log_pipeline = Pipeline(
    [
        ("preprocessor", preprocessor),
        ('smote', SMOTENC(random_state=11,categorical_features=[0,1,2,3])),
        ("clf", SGDClassifier()),
    ]
)

So we test the preprocessing part of the input, from your text vectorizer we expect an output of:

text_transformer.fit_transform(X[textual_feature]).shape
(1771, 178)

Together with our 4 ordinal encoded features, the output from the preprocessor is what we expect:

preprocessor.fit_transform(X).shape
(1771, 182) 

#for display purpose
preprocessor.fit_transform(X).round(3)
 
array([[0.   , 0.   , 0.   , ..., 0.308, 0.   , 0.361],
       [1.   , 2.   , 2.   , ..., 0.252, 0.12 , 0.099],
       [2.   , 0.   , 1.   , ..., 0.05 , 0.   , 0.   ],
       ...,
       [1.   , 1.   , 2.   , ..., 0.119, 0.226, 0.   ],
       [1.   , 1.   , 0.   , ..., 0.   , 0.   , 0.   ],
       [1.   , 0.   , 1.   , ..., 0.   , 0.   , 0.   ]])

In this instance, the first 4 columns are your 4 categorical features, ordinal encoded. And the final matrix would be 4 + (features from text vectorizer)

Let's fit and we can check what goes into your classifier:

sgd_log_pipeline.fit_resample(X,y)
sgd_log_pipeline.named_steps['clf'].n_features_in_

182
StupidWolf
  • 45,075
  • 17
  • 40
  • 72