4

I am trying to use LogisticRegression for text classification. I am using FeatureUnion for the features of the DataFrame and then cross_val_score to test the accuracy of the classifier. However, I don't know how to include the feature with the free text, called tweets, within the pipeline. I am using the TfidfVectorizer for the bag of words model.

nominal_features = ["tweeter", "job", "country"]
numeric_features = ["age"]

numeric_pipeline = Pipeline([
    ("selector", DataFrameSelector(numeric_features))
])

nominal_pipeline = Pipeline([
    ("selector", DataFrameSelector(nominal_features)), 
     "onehot", OneHotEncoder()])

text_pipeline = Pipeline([
    ("selector", DataFrameSelector("tweets")),    
    ("vectorizer", TfidfVectorizer(stop_words='english'))])

pipeline = Pipeline([("union", FeatureUnion([("numeric_pipeline", numeric_pipeline),
                                             ("nominal_pipeline", nominal_pipeline)])), 
                                             ("estimator", LogisticRegression())])

np.mean(cross_val_score(pipeline, df, y, scoring="accuracy", cv=5))

Is this the right way to include the tweets free text data in the pipeline?

Paul K
  • 123
  • 7
  • You have not included your `text_pipeline` into the main `pipeline`. So how will it work? – Vivek Kumar Nov 28 '18 at 10:38
  • see https://medium.com/@baemaek/text-mining-preprocess-and-naive-bayes-classifier-da0000f633b2 (Text Mining using preprocessing and Naïve Bayes Classifier) – Golden Lion Jan 19 '21 at 23:36

1 Answers1

0
pipeline = Pipeline([
('vect', CountVectorizer(stop_words='english',lowercase=True)),
("tfidf1", TfidfTransformer(use_idf=True,smooth_idf=True)),
('clf', MultinomialNB(alpha=1)) #Laplace smoothing
 ])

 train,test=train_test_split(df,test_size=.3,random_state=42, shuffle=True)
 pipeline.fit(train['Text'],train['Target'])

 predictions=pipeline.predict(test['Text'])
 print(test['Target'],predictions)

 score = f1_score(test['Target'],predictions,pos_label='positive',average='micro')
 print("Score of Naive Bayes is :" , score)
Golden Lion
  • 3,840
  • 2
  • 26
  • 35
  • see (https://www.kdnuggets.com/2018/11/multi-class-text-classification-model-comparison-selection.html/2) Doc2Vec offer greater accuracy using logistic regression than MultinomialNB – Golden Lion Jan 20 '21 at 16:19