TL;DR
The demo_liu_hu_lexicon
function is a demo function of how you could use the opinion_lexicon
. It's used for testing and should not be used directly.
In Long
Let's look at the function and see how we can re-create a similar function https://github.com/nltk/nltk/blob/develop/nltk/sentiment/util.py#L616
def demo_liu_hu_lexicon(sentence, plot=False):
"""
Basic example of sentiment classification using Liu and Hu opinion lexicon.
This function simply counts the number of positive, negative and neutral words
in the sentence and classifies it depending on which polarity is more represented.
Words that do not appear in the lexicon are considered as neutral.
:param sentence: a sentence whose polarity has to be classified.
:param plot: if True, plot a visual representation of the sentence polarity.
"""
from nltk.corpus import opinion_lexicon
from nltk.tokenize import treebank
tokenizer = treebank.TreebankWordTokenizer()
Okay, that's a strange use for imports to exist inside the function but this is because it's a demo function use for simple testing or documentation.
Also, the usage of treebank.TreebankWordTokenizer()
is rather odd, we can simply use the nltk.word_tokenize
.
Let's move the imports out and rewrite the demo_liu_hu_lexicon
as a simple_sentiment
function.
from nltk.corpus import opinion_lexicon
from nltk import word_tokenize
def simple_sentiment(text):
pass
Next, we see
def demo_liu_hu_lexicon(sentence, plot=False):
"""
Basic example of sentiment classification using Liu and Hu opinion lexicon.
This function simply counts the number of positive, negative and neutral words
in the sentence and classifies it depending on which polarity is more represented.
Words that do not appear in the lexicon are considered as neutral.
:param sentence: a sentence whose polarity has to be classified.
:param plot: if True, plot a visual representation of the sentence polarity.
"""
from nltk.corpus import opinion_lexicon
from nltk.tokenize import treebank
tokenizer = treebank.TreebankWordTokenizer()
pos_words = 0
neg_words = 0
tokenized_sent = [word.lower() for word in tokenizer.tokenize(sentence)]
x = list(range(len(tokenized_sent))) # x axis for the plot
y = []
The function
- first tokenized and lower-cased the sentence
- initialize the number of positive and negative words.
x
and y
is initialized for some plotting later, so let's ignore that.
If we go further down the function:
def demo_liu_hu_lexicon(sentence, plot=False):
from nltk.corpus import opinion_lexicon
from nltk.tokenize import treebank
tokenizer = treebank.TreebankWordTokenizer()
pos_words = 0
neg_words = 0
tokenized_sent = [word.lower() for word in tokenizer.tokenize(sentence)]
x = list(range(len(tokenized_sent))) # x axis for the plot
y = []
for word in tokenized_sent:
if word in opinion_lexicon.positive():
pos_words += 1
y.append(1) # positive
elif word in opinion_lexicon.negative():
neg_words += 1
y.append(-1) # negative
else:
y.append(0) # neutral
if pos_words > neg_words:
print('Positive')
elif pos_words < neg_words:
print('Negative')
elif pos_words == neg_words:
print('Neutral')
The loop simply go through each token and check wether the word is in the positive / negative lexicon.
At the end, it checks the no. of positive and negative words and return the tag.
Now lets see whether we can have a better simple_sentiment
function, now that we know what demo_liu_hu_lexicon
do.
Tokenization in step 1 can't be avoided, so we have:
from nltk.corpus import opinion_lexicon
from nltk.tokenize import treebank
def simple_sentiment(text):
tokens = [word.lower() for word in word_tokenize(text)]
There's an lazy way out to do step 2-5 is to just copy+paste and change the print()
-> return
from nltk.corpus import opinion_lexicon
from nltk.tokenize import treebank
def simple_sentiment(text):
tokens = [word.lower() for word in word_tokenize(text)]
for word in tokenized_sent:
if word in opinion_lexicon.positive():
pos_words += 1
y.append(1) # positive
elif word in opinion_lexicon.negative():
neg_words += 1
y.append(-1) # negative
else:
y.append(0) # neutral
if pos_words > neg_words:
return 'Positive'
elif pos_words < neg_words:
return 'Negative'
elif pos_words == neg_words:
return 'Neutral'
Now, you have a function that you can do whatever you please.
BTW, the demo is really odd..
When we see a positive word add 1 and when we see a negative we add -1
.
And we say something is positive when pos_words > neg_words
.
That means that the list of integers comparison follows some Pythonic sequence comparison that might have no linguistic or mathematical logic =(See What happens when we compare list of integers?)