I have been wrangling with R to classify tweets using a Naive Bayes classifier model.
Data:
Training set with 2 columns: Tweet and Class. There are 300 tweets: 150 classified as "App" and 150 classified as "Other".
Objective:
Test set with 20 data points (tweets) - the first 10 are "App" and the last 10 are "Other". I want to predict this. I can successfully produce a Naive Bayes model in Excel (blekh) and predict 19 of the 20 correctly.
I want to replicate this with R.
Code snippet
library(tm)
library('e1071')
# Custom Function
replacePunctuation <- function(x)
{
x <- tolower(x)
x <- gsub("[.]+[ ]"," ",x)
x <- gsub("[:]+[ ]"," ",x)
x <- gsub("[?]"," ",x)
x <- gsub("[!]"," ",x)
x <- gsub("[;]"," ",x)
x <- gsub("[,]"," ",x)
x
}
# Process text - tolower(), remove punctuation etc.
tweets.all$Tweet <- replacePunctuation(tweets.all$Tweet)
tweets.test$Tweet <- replacePunctuation(tweets.test$Tweet)
# Create a corpus for training and testing data set
tweets.train.corpus <- Corpus(VectorSource(as.vector(tweets.all$Tweet)))
tweets.test.corpus <- Corpus(VectorSource(as.vector(tweets.test$Tweet)))
# Create term document matrix but only get word lenghts that are 4 or above
tweets.train.matrix <- t(TermDocumentMatrix(tweets.train.corpus,control=list(wordLengths=c(4,Inf))));
tweets.test.matrix <- t(TermDocumentMatrix(tweets.test.corpus,control = list(wordLengths=c(4,Inf))));
# Build model with additive smoothing as 1
model <- naiveBayes(as.matrix((tweets.train.matrix)),as.factor(tweets.all$class),laplace=1)
#Predict
results <- predict(object=model,newdata=as.matrix(tweets.test.matrix));
results
Data Sample
A call to head(tweets.all) yields:
Tweet class
1 [blog] Using Nullmailer and Mandrill for your Ubuntu Linux server outboud mail: https://opensourcehacker.com/2013/03/25/using-nullmailer-and-mandrill-for-your-ubuntu-linux-server-outboud-mail/?utm_source=twitterfeed&utm_medium=twitter #plone App
2 [blog] Using Postfix and free Mandrill email service for SMTP on Ubuntu Linux server: https://opensourcehacker.com/2013/03/26/using-postfix-and-free-mandrill-email-service-for-smtp-on-ubuntu-linux-server/?utm_source=twitterfeed&utm_medium=twitter #plone App
3 @aalbertson There are several reasons emails go to spam. Mind submitting a request at http://help.mandrill.com with additional details? App
4 @adrienneleigh I just switched it over to Mandrill, let's see if that improve the speed at which the emails are sent. App
5 @ankeshk +1 to @mailchimp We use MailChimp for marketing emails and their Mandrill app for txn emails... @sampad @abhijeetmk @hiway App
6 @biggoldring That error may occur if unsupported auth method used. Can you email us via http://help.mandrill.com so we can get details? App
A call to head(tweets.test) yields:
Tweet
1 Just love @mandrillapp transactional email service - http://mandrill.com Sorry @SendGrid and @mailjet #timetomoveon
2 @rossdeane Mind submitting a request at http://help.mandrill.com with account details if you haven't already? Glad to take a look!
3 @veroapp Any chance you'll be adding Mandrill support to Vero?
4 @Elie__ @camj59 jparle de relai SMTP!1 million de mail chez mandrill / mois comparŽ ˆ 1 million sur lite sendgrid y a pas photo avec mailjet
5 would like to send emails for welcome, password resets, payment notifications, etc. what should i use? was looking at mailgun/mandrill
6 From Coworker about using Mandrill: "I would entrust email handling to a Pokemon".
Output
This is what I get:
[1] Other Other Other Other Other Other Other Other Other Other Other Other Other Other Other Other Other Other Other Other
Levels: App Other
Which is rubbish - i.e. not classifying correctly. Any idea what I'm doing wrong?