3

I have been wrangling with R to classify tweets using a Naive Bayes classifier model.

Data:

Training set with 2 columns: Tweet and Class. There are 300 tweets: 150 classified as "App" and 150 classified as "Other".

Objective:

Test set with 20 data points (tweets) - the first 10 are "App" and the last 10 are "Other". I want to predict this. I can successfully produce a Naive Bayes model in Excel (blekh) and predict 19 of the 20 correctly.

I want to replicate this with R.

Code snippet

library(tm)
library('e1071')

# Custom Function 
replacePunctuation <- function(x)
{
  x <- tolower(x)
  x <- gsub("[.]+[ ]"," ",x)
  x <- gsub("[:]+[ ]"," ",x)
  x <- gsub("[?]"," ",x)
  x <- gsub("[!]"," ",x)
  x <- gsub("[;]"," ",x)
  x <- gsub("[,]"," ",x)
  x
}

# Process text - tolower(), remove punctuation etc. 
tweets.all$Tweet <- replacePunctuation(tweets.all$Tweet)
tweets.test$Tweet <- replacePunctuation(tweets.test$Tweet)

# Create a corpus for training and testing data set
tweets.train.corpus <- Corpus(VectorSource(as.vector(tweets.all$Tweet)))
tweets.test.corpus <- Corpus(VectorSource(as.vector(tweets.test$Tweet)))

# Create term document matrix but only get word lenghts that are 4 or above
tweets.train.matrix <- t(TermDocumentMatrix(tweets.train.corpus,control=list(wordLengths=c(4,Inf))));
tweets.test.matrix <- t(TermDocumentMatrix(tweets.test.corpus,control = list(wordLengths=c(4,Inf))));

# Build model with additive smoothing as 1
model <- naiveBayes(as.matrix((tweets.train.matrix)),as.factor(tweets.all$class),laplace=1)

#Predict
results <- predict(object=model,newdata=as.matrix(tweets.test.matrix));
results

Data Sample

A call to head(tweets.all) yields:

 Tweet class
 1                            [blog] Using Nullmailer and Mandrill for your Ubuntu Linux server outboud mail:  https://opensourcehacker.com/2013/03/25/using-nullmailer-and-mandrill-for-your-ubuntu-linux-server-outboud-mail/?utm_source=twitterfeed&utm_medium=twitter  #plone   App
 2                     [blog] Using Postfix and free Mandrill email service for SMTP on Ubuntu Linux server:  https://opensourcehacker.com/2013/03/26/using-postfix-and-free-mandrill-email-service-for-smtp-on-ubuntu-linux-server/?utm_source=twitterfeed&utm_medium=twitter  #plone   App
 3 @aalbertson There are several reasons emails go to spam. Mind submitting a request at http://help.mandrill.com  with additional details?   App
 4                    @adrienneleigh I just switched it over to Mandrill, let's see if that improve the speed at which the emails are sent.   App
 5      @ankeshk +1 to @mailchimp We use MailChimp for marketing emails and their Mandrill app for txn emails... @sampad @abhijeetmk @hiway   App
 6 @biggoldring That error may occur if unsupported auth method used. Can you email us via http://help.mandrill.com  so we can get details?   App

A call to head(tweets.test) yields:

Tweet
1   Just love @mandrillapp transactional email service - http://mandrill.com Sorry @SendGrid and @mailjet #timetomoveon
2   @rossdeane Mind submitting a request at http://help.mandrill.com with account details if you haven't already? Glad to take a look!
3   @veroapp Any chance you'll be adding Mandrill support to Vero?
4   @Elie__ @camj59 jparle de relai SMTP!1 million de mail chez mandrill / mois comparŽ ˆ 1 million sur lite sendgrid y a pas photo avec mailjet
5   would like to send emails for welcome, password resets, payment notifications, etc. what should i use? was looking at mailgun/mandrill
6   From Coworker about using Mandrill:  "I would entrust email handling to a Pokemon".

Output

This is what I get:

 [1] Other Other Other Other Other Other Other Other Other Other Other Other Other Other Other Other Other Other Other Other
 Levels: App Other

Which is rubbish - i.e. not classifying correctly. Any idea what I'm doing wrong?

epo3
  • 2,991
  • 2
  • 33
  • 60
salemmarafi
  • 230
  • 3
  • 12
  • 2
    Why should we agree with you that it "is rubbish"? There is the glaring defect in this question that it has no data, and the further defect that these functions are not base R and you name no package(s). – IRTFM May 03 '14 at 16:17
  • Hi @BondedDust, Thanks for the comment! The rubbish refers to the classification quality - the first 10 in test are "app" and the last 10 are "other" and so we would expect some variation of "App" and "Other" but the code above does not give that. I updated the question with sample data and the libraries / functions being used. Hope this helps, I'm a bit lost on how to move forward. Thanks :) – salemmarafi May 03 '14 at 21:20
  • You need more data. Text classification is a high dimensional problem. – Spaceghost May 04 '14 at 23:12
  • @Spaceghost thanks. I'm able to classify the tweets without using the naiveBayes() function from e1071 just by doing the calculations manually - see: [link](http://www.salemmarafi.com/code/twitter-naive-bayes/#r-code) so perhaps I'm doing something wrong with the naiveBayes() function? – salemmarafi May 05 '14 at 09:19
  • @salemmarafi did you manage to get past your problem? – John x Dec 11 '14 at 16:53
  • 1
    @Johnx no unfortunately not but you can see the approach I took in the link. – salemmarafi Dec 13 '14 at 18:29
  • @salemmarafi tnx for your time – John x Dec 14 '14 at 10:29

0 Answers0