I am trying to use LogisticRegression
for text classification. I am using FeatureUnion
for the features of the DataFrame
and then cross_val_score
to test the accuracy of the classifier. However, I don't know how to include the feature with the free text, called tweets
, within the pipeline. I am using the TfidfVectorizer
for the bag of words model.
nominal_features = ["tweeter", "job", "country"]
numeric_features = ["age"]
numeric_pipeline = Pipeline([
("selector", DataFrameSelector(numeric_features))
])
nominal_pipeline = Pipeline([
("selector", DataFrameSelector(nominal_features)),
"onehot", OneHotEncoder()])
text_pipeline = Pipeline([
("selector", DataFrameSelector("tweets")),
("vectorizer", TfidfVectorizer(stop_words='english'))])
pipeline = Pipeline([("union", FeatureUnion([("numeric_pipeline", numeric_pipeline),
("nominal_pipeline", nominal_pipeline)])),
("estimator", LogisticRegression())])
np.mean(cross_val_score(pipeline, df, y, scoring="accuracy", cv=5))
Is this the right way to include the tweets
free text data in the pipeline?