Twitter sentiment analysis on a string

Question

I've written a program that takes a twitter data that contains tweets and labels (0 for neutral sentiment and 1 for negative sentiment) and predicts which category the tweet belongs to. The program works well on the training and test Set. However I'm having problem in applying prediction function with a string. I'm not sure how to do that.

I have tried cleaning the string the way I cleaned the dataset before calling the predict function but the values returned are in wrong shape.

import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
import re

#Loading dataset
dataset = pd.read_csv('tweet.csv')

#List to hold cleaned tweets
clean_tweet = []

#Cleaning tweets
for i in range(len(dataset)):
    tweet = re.sub('[^a-zA-Z]', ' ', dataset['tweet'][i])
    tweet = re.sub('@[\w]*',' ',dataset['tweet'][i])
    tweet = tweet.lower()
    tweet = tweet.split()
    tweet = [ps.stem(token) for token in tweet if not token in set(stopwords.words('english'))]
    tweet = ' '.join(tweet)
    clean_tweet.append(tweet)

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 3000)
X = cv.fit_transform(clean_tweet)
X =  X.toarray()
y = dataset.iloc[:, 1].values

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

from sklearn.naive_bayes import GaussianNB
n_b = GaussianNB()
n_b.fit(X_train, y_train)
y_pred  = n_b.predict(X_test) 

some_tweet = "this is a mean tweet"  # How to apply predict function to this string

Robin · Accepted Answer · 2019-07-03T14:50:25.307

2

Use cv.transform([cleaned_new_tweet]) on your new string to transform your new Tweet to your existing document-term matrix. That will return the Tweet in the correct shape.

edited Jul 03 '19 at 14:50

answered Jul 03 '19 at 14:33

Robin

191
8

`cv.transform()` on my new string gives me an error - `ValueError: Iterable over raw text documents expected, string object received.` – Imanpal Singh Jul 03 '19 at 14:46
Sorry, cv.transform() takes an object of type iterable, so you will need to add the new_tweet part of an iterable. I've updated the answer and that should work. – Robin Jul 03 '19 at 14:50
Thank you, it worked. However can you tell me why `cv.fit_transform()` would be wrong here ? – Imanpal Singh Jul 03 '19 at 14:56
https://stackoverflow.com/questions/38692520/what-is-the-difference-between-fit-transform-and-transform-in-sklearn-countvecto this should point you in the right direction. – Robin Jul 03 '19 at 15:05

score 2 · Answer 2 · answered Jul 03 '19 at 15:06

tl;dr

.predict() expects a list of strings. So you need to add some_tweet to a list. E.g. new_tweet = ["this is a mean tweet"]

Your code

You had some issues in your code that I tried fixing for you...

import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
import re

#Loading dataset
dataset = pd.read_csv('tweet.csv')


# Define cleaning function
# You can define it once as a function so it can be easily re-used else where
def clean_tweet(tweet: str):
    tweet = re.sub('[^a-zA-Z]', ' ', dataset['tweet'][i])
    tweet = re.sub('@[\w]*', ' ', tweet) #BUG: you need to pass the tweet you modified here instead of the original tweet again
    tweet = tweet.lower()
    tweet = tweet.split()
    tweet = [ps.stem(token) for token in tweet if not token in set(stopwords.words('english'))]
    tweet = ' '.join(tweet)
    return tweet

#List to hold cleaned tweets and labels
X = [clean_tweet(tweet) for tweet in dataset['tweet']] # you can create your X directly with your new function
y = dataset.iloc[:, 1].values

# Define a single model
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import Pipeline

# Use Pipeline as your classifier, this way you don't need to keep calling a transform and fit all the time.
classifier = Pipeline(
    [
        ('cv', CountVectorizer(max_features=300)),
        ('n_b', GaussianNB())
    ]
)


# Before you trained your CountVectorizer BEFORE splitting into train/test. That is a biiig mistake.
# First you split to train/split and then you train all the steps of your model.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Here you train all steps of your Pipeline in one go.
classifier.fit(X_train, y_train)
y_pred  = classifier.predict(X_test)


# Predicting new tweets
some_tweet = "this is a mean tweet"
some_tweet = clean_tweet(some_tweet) # re-use your clean function
predicted = classifier.predict([some_tweet]) # put the tweet inside a list!!!!

Thank you so much for this. Can you share a good resource on the Pipeline that you mentioned. I am new to this I haven't learned that yet. Also does splitting increase any performance or it just helps later in calculating various scores and accuracies — Imanpal Singh, Jul 03 '19 at 16:56
[this](https://datascience.stackexchange.com/questions/33008/is-it-always-better-to-use-the-whole-dataset-to-train-the-final-model) goes into a good amount of detail on training/test splits. — Robin, Jul 04 '19 at 11:54

Twitter sentiment analysis on a string

2 Answers2

tl;dr

Your code