-4

For a side project of mine, I am trying to build a Naives Bayes model that can detect if a piece of news is fake based on the headline. Here is my code so far:

import numpy as np
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

data = pd.read_csv("/Users/amanpuranik/Desktop/fake-news-detection/data.csv")
data = data[['Headline', "Label"]]
print(data)

x = data[["Headline"]]
y = data[["Label"]]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=1)

tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.7)

model = MultinomialNB()
model.fit(x_train, y_train)

When I run this, I get an error that tells me the headline cannot be converted to a float value. Since the headline is made up of a bunch of words, I was wondering what my next steps could be as im not sure how a word could be converted to a float.

theguy
  • 49
  • 6
  • 5
    Which float should a word be converted to, for example? – mkrieger1 Apr 05 '20 at 22:46
  • I haven't learned enough ML, but from what I know, you will have to convert the headline into a list of numbers, each with a certain meaning, for example, number of words in the headline, average word length, number of times a particular word is used, and map those numbers between 0 and 1. Correct me if I am wrong. – Sanjit Sarda Apr 05 '20 at 23:07
  • It appears that you need to read more on Natural Language Processing (NLP) to learn about various ways to code input for the desired processing, and then to pick one method. This is an issue far too broad for Stack Overflow. – Prune Apr 05 '20 at 23:12

1 Answers1

2

If I understand right, you want to vectorize the text using the TfidfVectorizer first, then try to classify the resulting vectors with the MultinomialNB model. I recommend you wrap these two steps in a pipeline, to make it easier to deploy the model, cross-validate, or add more steps.

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

data = pd.DataFrame({'Headline': ['Are Lizard Immigrants Stealing our Oil???',
                                  'Trade Summit Proceeds As Planned'],
                     'Label': ['Fake', 'Real']})

print(data)

X = data[['Headline']]
y = data['Label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 1)

tfidf_vectorizer=TfidfVectorizer(stop_words = 'english')

model = MultinomialNB()

pipeline = Pipeline([('vectorizer', tfidf_vectorizer), ('classifier', model)])

pipeline.fit(X_train, y_train)

print(pipeline)

Output:

                                    Headline Label
0  Are Lizard Immigrants Stealing our Oil???  Fake
1           Trade Summit Proceeds As Planned  Real
Pipeline(memory=None,
         steps=[('vectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words='english', strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('classifier',
                 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
         verbose=False)

Note that I removed the inner brackets from your code when I extracted y from the dataframe, since it should be 1-dimensional.

l_l_l_l_l_l_l_l
  • 528
  • 2
  • 8
  • thats strange, when I run your code I get a "ValueError: max_df corresponds to < documents than min_df" error – theguy Apr 06 '20 at 00:08
  • @puranikman are you sure you're using my code? (check the initialization of the tfidfvectorizer. I took out your `max_df` parameter). You can learn more about `max_df` and `min_df` in the [docs](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). – l_l_l_l_l_l_l_l Apr 06 '20 at 00:11