3

I have got a dataset which contains just two useful columns for training my model, first is news heading and the second is category of news.

So, I got the following training command running successfully using python:

import re
import numpy as np
import pandas as pd
# the Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
# function to split the data for cross-validation
from sklearn.model_selection import train_test_split
# function for transforming documents into counts
from sklearn.feature_extraction.text import CountVectorizer
# function for encoding categories
from sklearn.preprocessing import LabelEncoder


# grab the data
news = pd.read_csv("/Users/helloworld/Downloads/NewsAggregatorDataset/newsCorpora.csv",encoding='latin-1')
news.head()

def normalize_text(s):
    s = s.lower()

    # remove punctuation that is not word-internal (e.g., hyphens, apostrophes)
    s = re.sub('\s\W',' ',s)
    s = re.sub('\W\s',' ',s)

    # make sure we didn't introduce any double spaces
    s = re.sub('\s+',' ',s)

    return s

news['TEXT'] = [normalize_text(s) for s in news['TITLE']]

# pull the data into vectors
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(news['TEXT'])

encoder = LabelEncoder()
y = encoder.fit_transform(news['CATEGORY'])

# split into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

nb = MultinomialNB()
nb.fit(x_train, y_train)

So my question is, how can I give a new set of data (e.g. Just news heading) and tell the program to predict the news category using python sklearn command?

P.S. My training data is like:

enter image description here

  • Have you tried using the `predict` method that's part of the `MultinomialNB` class? http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html. You already have it trained based on the titles and the output is the category. To use Naive Bayes on test data, do the same transformation of features that you did for training, then submit it into the Naive Bayes classifier. – rayryeng Nov 22 '17 at 15:33
  • @why dont you just use: y-predicted = nb.predict(x_test) ??? – seralouk Nov 22 '17 at 16:04

2 Answers2

4

You should train the model using the training data (as you did) and then you should predict using new data (the test data).


Do the following:

nb = MultinomialNB()
nb.fit(x_train, y_train)

y_predicted = nb.predict(x_test)

Now, if you want to evaluate the predictions based on the **accuracy you can do the following:**

from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_predicted) 

Similarly, you can calculate other metrics.

Finally, we can see all the available metrics here !


EDIT 1

When you type:

 y_predicted = nb.predict(x_test)

y_predicted will contain numerical values that correspond to your categories.

To project back these values and get the labels you can do:

y_predicted_labels = encoder.inverse_transform(y_predicted) 
seralouk
  • 30,938
  • 9
  • 118
  • 133
  • Thanks for the tip, serafeim. I tried y_predicted = nb.predict(x_test) but it returns an array of numbers like 2,1,3.... What does it mean? My category should be like 'b','a' or 'c' in my training dataset. – userIndulgeInDChord Nov 24 '17 at 10:29
  • 1,2,3 means a,b,c – seralouk Nov 24 '17 at 12:32
  • serafeim, my category has 'b','t','e' and 'm'. How can I transform those numbers from the array to these mentioned category? Your help is highly appreciated. :-) – userIndulgeInDChord Nov 24 '17 at 15:35
  • @userIndulgeInDChord I edited my answer explaining how to get the actual labels. please consider to accept my answer. You used LabelEncoder to encode the y variable and in the same way you can get the actual labels. – seralouk Nov 24 '17 at 17:33
  • Thanks a million, serafeim. You saved me days! ;-) – userIndulgeInDChord Nov 25 '17 at 05:34
  • @serafeim so are predictions limited to the starting dataset? How do I add a different dataset to predict on, based on what has been learned from a training set? – Vaidøtas I. Jul 30 '19 at 07:20
  • @VaidøtasIvøška for a new external dataset you can use `y_predicted = nb.predict(x_external)` after having trained the `nb` model using some different training data. – seralouk Jul 30 '19 at 09:58
  • @serafeim does the new external dataset need to have the same structure as well as the target column data? – Vaidøtas I. Jul 30 '19 at 10:28
  • It only need to have the same exact number of features/variables. No need for target variable. Only the X data will be used for prediction – seralouk Jul 30 '19 at 10:58
  • @serafeim see that's what I don't get - I try to only use the feature columns and get an error of shape inequality "operands could not be broadcast together with shapes". how can I train with 10 columns (9 features, 1 target) and then give 9 features for the fit model to predict the 1 target I do not know? – Vaidøtas I. Jul 30 '19 at 16:04
  • if the initial model is trained using 9 features and then the new set has also 9 features, this error should not occur. you can create a new question and point me to it in order to provide an answer – seralouk Jul 30 '19 at 17:36
  • @seralouk What if i want to provide only title, and expect the saved naive model to predict the title category? Here, i have no record of the vectorized vocabs. – Tanmay Bairagi Dec 29 '20 at 10:58
1

You are very close. Just need two more lines of code. Use this link, explains Naives Bayes using Sci Kit, https://www.digitalocean.com/community/tutorials/how-to-build-a-machine-learning-classifier-in-python-with-scikit-learn

The short answer to your question is below, import the accuracy function,

from sklearn.metrics import accuracy_score

test the model using the predict function,

preds = nb.predict(x_test)

and then test the accuracy

print(accuracy_score(y_test, preds))
Alex Jacob
  • 79
  • 2
  • 5
  • Thanks for your tips, Ajith. The loaded dataset actually was split into training and test. As a result, I would like to load a new dataset and let the program predicts the news category. – userIndulgeInDChord Nov 22 '17 at 16:01