When working with text data, I understand the need to encode text labels into some numeric representation (i.e., by using LabelEncoder
, OneHotEncoder
etc.)
However, my question is whether you need to perform this step explicitly when you're using some feature extraction class (i.e. TfidfVectorizer
, CountVectorizer
etc.) or whether these will encode the labels under the hood for you?
If you do need to encode the labels separately yourself, are you able to perform this step in a Pipeline
(such as the one below)
pipeline = Pipeline(steps=[
('tfidf', TfidfVectorizer()),
('sgd', SGDClassifier())
])
Or do you need encode the labels beforehand since the pipeline expects to fit()
and transform()
the data (not the labels)?