0

I am building a sentiment analysis tool using Google Prediction API . I have some labeled training data which I will be using to train the model . Since this is data collected from Social Media , most of the words in the sentences are stop words and I would like to remove this before training the model ,will that help in improving the accuracy ? Is there any library in java I can use to remove these stop words instead of building my own set of stop words .

Regards Deepesh

Deepesh Shetty
  • 1,116
  • 3
  • 18
  • 43

3 Answers3

2

Stop-words will help, but I'm afraid you will need to come up with your own list specifically tailored to sentiment analysis (e.g. no off-the-shelf list). Here are some more ideas, which might give you a boost in prediction accuracy without putting a tremendous amount of work into the creation of your own stopword list (ideas taken from our submission to the CrowdFlower OpenData competition on Kaggle) :

  • Stopwords: removing stopwords like 'RT','@','#','link','google','facebook','yahoo','rt'
  • Character repetitions: remove repeated sets of characters in a word (e.g. “hottttt” was replaced with “hot”)
  • Spelling correction: Spelling correction based on Levenshtein Distance with a given corpus.
  • Emote icons: Make sure emote icons are not removed or ignored in your data-cleansing step (not sure how the Google Prediction API handles this).

For more ideas, also take look at this forum thread.

Matt
  • 17,290
  • 7
  • 57
  • 71
2

Unless your sentiment analysis is in areas that are well defined and researched with a large corpus and large defined training sets (e.g., movie reviews), I'd suggest that you build your own data for training. This is even more true when working with social media data (especially Twitter). Depending on your area of research/analysis, building your own training dataset will allow you to focus your time on building a domain-specific dataset rather than trying to use a non-domain set.

I'd second Matt's response RE: some suggestions. I'd also add that you should look to remove urls and usernames from your data and consider them 'stopwords'.

Eric D. Brown D.Sc.
  • 1,896
  • 7
  • 25
  • 37
1

That depends on how Google Prediction's algorithm works. I'm not familiar with it, but in reading the docs it appears they do not consider word association. That is to say, they do not consider which word a sentiment-laden stop word like "not" is particularly modifying.

For example,

"Cake is not close to being as good as french fries!"
"French fries are not cake, but are not bad."

In the above sentences, treating them as a "bag of words" (a sentence model in which word order does not matter) doesn't yield us much insight.

My recommendation is to experiment and let your data results be your guide.

I suspect using stop words will not make much a difference. They should fall below the "noise" threshold of Google's matching algorithm, assuming I'm divining how it works correctly.

You can google-up a list of stop words for several languages. You can also pull many Natural Language Processing libraries. Stemming words might help. Try googling for Porter Stemming or Snowball Stemming" and Java. Lucene/Solr uses this sort of analysis to build up search indexes.

Good luck.

Sam
  • 2,939
  • 19
  • 17