6

I am just doing some research into NLP with Python and I have identified something strange.

On review of the following negative tweets:

neg_tweets = [('I do not like this car', 'negative'),
          ('This view is horrible', 'negative'),
          ('I feel tired this morning', 'negative'),
          ('I am not looking forward to the concert', 'negative'),<---
          ('He is my enemy', 'negative')]

And with some processing by removing stop words.

clean_data = []
stop_words = set(stopwords.words("english"))

for (words, sentiment) in pos_tweets + neg_tweets:
words_filtered = [e.lower() for e in words.split() if e not in stop_words]
clean_data.append((words_filtered, sentiment))

Part of the output is:

 (['i', 'looking', 'forward', 'concert'], 'negative')

I'm struggling to understand why the stop words include 'not' which can affect the sentiment of a tweet.

My understanding is that stop words have no value in terms of sentiment.

So, My question is why is 'not' included in the stopwords list?

Andrew Daly
  • 537
  • 3
  • 12
  • https://datascience.stackexchange.com/questions/15765/nlp-why-is-not-a-stop-word –  Jun 27 '17 at 15:53
  • refer to https://stats.stackexchange.com/questions/205078/latent-semantic-analysis-stop-words-and-link-words – danche Jun 27 '17 at 15:55
  • Mainly because they're most typically used in search and retrieval. Which isn't your use-case. – pvg Jun 27 '17 at 15:57
  • 1
    I don't know the why but I think you can do something like: take_out_not = set(('not')) stop_words = set(stopwords.words("english")) - take_out_not – mikeY Jun 27 '17 at 15:57
  • The stopwords list is not specifically designed for sentiment analysis. Before you do stopword removal, customize what you want to remove. E.g., you can manually remove negation words from the stopwords list. – alexis Jun 27 '17 at 23:47
  • That's the plan going forward, I will manually remove the negation words from my list, thanks! – Andrew Daly Jun 28 '17 at 09:13

1 Answers1

5

Stopwords in a sentence are "generally" of little or no use. As said by Stanford NLP group:

Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words

Why the word "not"? : Simply because it appears very often in the english vocabulary, and is "usually" of little or no importance, for example if you are doing text summarization where these stopwords are of little to no use and it is all determined by the frequency distribution of words(like tf-idf.

So what can you do? Well, this is a very broad topic known as Negation Handling. It is a very broad area with many different methods. One of my favorite ones is to simply append preceding or succeeding negation clauses, before removing the stopwords or calculating word vectors. For example, you can convert not looking to not_looking which when computed upon and converted to vector space will be quite different. You can find a code for doing something similar in an SO answer here.

I hope this helps!

Rudresh Panchal
  • 980
  • 4
  • 16