Are all the features correctly selected and used in a classifier?

Question

I would like to know if when I use a classifier, for example:

random_forest_bow = Pipeline([
        ('rf_tfidf',Feat_Selection. countV),
        ('rf_clf',RandomForestClassifier(n_estimators=300,n_jobs=3))
        ])
    
random_forest_ngram.fit(DataPrep.train['Text'],DataPrep.train['Label'])
predicted_rf_ngram = random_forest_ngram.predict(DataPrep.test_news['Text'])
np.mean(predicted_rf_ngram == DataPrep.test_news['Label'])

I am also considering other features in the model. I defined X and y as follows:

X=df[['Text','is_it_capital?', 'is_it_upper?', 'contains_num?']]
y=df['Label']

X_train, X_test, y_train, y_test  = train_test_split(X, y, test_size=0.25, random_state=40) 

df_train= pd.concat([X_train, y_train], axis=1)
df_test = pd.concat([X_test, y_test], axis=1)

countV = CountVectorizer()
train_count = countV.fit_transform(df.train['Text'].values)

My dataset looks as follows

Text                             is_it_capital?     is_it_upper?      contains_num?   Label
an example of text                      0                  0               0            0
ANOTHER example of text                 1                  1               0            1
What's happening?Let's talk at 5        1                  0               1            1

I would like to use as features also is_it_capital?,is_it_upper?,contains_num?, but since they have binary values (1 or 0, after encoding), I should apply BoW only on Text to extract extra features. Maybe my question is obvious, but since I am a new ML learner and I am not familiar with classifiers and encoding, I will be thankful for all the support and comments that you will provide. Thanks

blacksite · Accepted Answer · 2021-02-18T15:35:40.843

You can certainly use your "extra" features like is_it_capital?, is_it_upper?, and contains_num?. It seems you're struggling with how exactly to combine the two seemingly disparate feature sets. You could use something like sklearn.pipeline.FeatureUnion or sklearn.compose.ColumnTransformer to apply your different encoding strategies to each set of features. There's no reason you couldn't use your extra features in combinations with whatever a text-feature extraction method (e.g. your BoW approach) would produce.

df = pd.DataFrame({'text': ['this is some text', 'this is some MORE text', 'hi hi some text 123', 'bananas oranges'], 'is_it_upper': [0, 1, 0, 0], 'contains_num': [0, 0, 1, 0]})

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.compose import ColumnTransformer

transformer = ColumnTransformer([('text', CountVectorizer(), 'text')], remainder='passthrough')
X = transformer.fit_transform(df)

print(X)
[[0 0 0 1 0 0 1 1 1 0 0]
 [0 0 0 1 1 0 1 1 1 1 0]
 [1 0 2 0 0 0 1 1 0 0 1]
 [0 1 0 0 0 1 0 0 0 0 0]]
print(transformer.get_feature_names())
['text__123', 'text__bananas', 'text__hi', 'text__is', 'text__more', 'text__oranges', 'text__some', 'text__text', 'text__this', 'is_it_upper', 'contains_num']

More on your specific example:

X=df[['Text','is_it_capital?', 'is_it_upper?', 'contains_num?']]
y=df['Label']

# Need to use DenseTransformer to properly concatenate results
# from CountVectorizer and other transformer steps
from sklearn.base import TransformerMixin
class DenseTransformer(TransformerMixin):
    def fit(self, X, y=None, **fit_params):
        return self
    def transform(self, X, y=None, **fit_params):
        return X.todense()

from sklearn.pipeline import Pipeline
pipeline = Pipeline([
     ('vectorizer', CountVectorizer()), 
     ('to_dense', DenseTransformer()), 
])

transformer = ColumnTransformer([('text', pipeline, 'Text')], remainder='passthrough')

X_train, X_test, y_train, y_test  = train_test_split(X, y, test_size=0.25, random_state=40)

X_train = transformer.fit_transform(X_train)
X_test = transformer.transform(X_test)

df_train = pd.concat([X_train, y_train], axis=1)
df_test = pd.concat([X_test, y_test], axis=1)

Thanks for your answer, black site. Yes, I am struggling with this problem, as it is not fully clear to me how to include those features. From your example, it seems that I should apply ColumnTransformer before splitting into train/test. Is it right? — LdM, Feb 18 '21 at 01:31
Yes. Apply the ColumnTransformer to your training dataset (`df_train`) using the `fit`/`fit_transform` methods. Then, apply the transformer to your test dataset (`df_test`) using *only the `transform`* method, since we've already learned what our training dataset needs to look like from the `fit`/`fit_transform` methods. More on fit/fit_transform [here](https://datascience.stackexchange.com/questions/12321/whats-the-difference-between-fit-and-fit-transform-in-scikit-learn-models). — blacksite, Feb 18 '21 at 01:53
Very, very similar to what you've done in your example with the CountVectorizer. The result of the ColumnTransformer process will give you your input matrices for training and test (`X_train`/`X_test`). I updated my answer above. — blacksite, Feb 18 '21 at 01:54
thanks for your answer and update on that. I tried to follow what you suggested, but I got the error: TypeError: cannot concatenate object of type ''; only Series and DataFrame objs are valid . Any idea on how I can fix it? — LdM, Feb 18 '21 at 14:45
You could do something like what's detailed in this [answer](https://stackoverflow.com/a/28384887/5015569). Edits above. — blacksite, Feb 18 '21 at 15:32

Prayson W. Daniel · Answer 2 · 2021-02-21T15:53:02.520

What I found useful is to have my transformation in a way that I have total control. For each set of columns, I would perform a specific transformation, and then in the end I union my transformations: Here is example

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.ensemble import RandomForestClassifier

# boolean
boolean_features = ['is_it_capital?', 'is_it_upper?','contains_num?',]
boolen_transformer = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy='most_frequent',)),
)
    ]
)

text_features = 'Text'
text_transformer = Pipeline(
    steps=[('vectorizer', CountVectorizer())]
)

# merge all pipelines

preprocessor = ColumnTransformer(
    transformers=[
        ('bool', boolean_transformer, boolean_features),
        ('text', text_transformer, text_features),
    ]
)

pipelines = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('model', RandomForestClassifier(n_estimators=300,n_jobs=3))
    ]
)

# spilt data to train and test
X_train_, X_test, y_train_, y_test = train_test_split(X, y, test_size=.1, random_state=42, stratify=y)


# we can train our model
pipelines.fit(X_train, y_train)
pipeline.score(X_test, y_test)

# what is awesome is using other tools like GridSearch becomes easy.

params = {'model__ n_estimators': [100, 200, 300], 'model__ criterion': ['gini', 'entropy']}

clf = GridSearchCV(
    pipelines, cv=5, n_jobs=-1, param_grid=params, scoring='roc_auc'
)

clf.fit(X_train, y_train)

# predict for totally unseen data
clf.predict(X_test)

Updates

If we have columns that needs no transformation and need to be included, add remainder='passthrough'

# assumption: above code does not have boolen_X
# ...
preprocessor = ColumnTransformer(
    transformers=[
        ('text', text_transformer, text_features),

    ], remainder='passthrough'
)
#...

See scikit-learn documentations and usage examples:

ColumnTransformer

Thanks for your answer Prayson. One question: may I apply the transformer for numeral variables even if I already have no NAs values and variables already encoded in 1 and 0 for them? — LdM, Feb 21 '21 at 12:45
Yes, it does not matter if there are no NAs. But if you don’t want any transformation then add `remainder='passthrough'`. I will update my answer — Prayson W. Daniel, Feb 21 '21 at 15:47

Are all the features correctly selected and used in a classifier?

2 Answers2

Updates