7

Big picture goal: I am making an LDA model of product reviews in Python using NLTK and Gensim. I want to run this on varying n-grams.

Problem: Everything is great with unigrams, but when I run with bigrams, I start to get topics with repeated information. For example, Topic 1 might contain: ['good product', 'good value'], and Topic 4 might contain: ['great product', 'great value']. To a human these are obviously conveying the same information, but obviously 'good product' and 'great product' are distinct bigrams. How do I algorithmically determine that 'good product' and 'great product' are similar enough, so I can translate all occurrences of one of them to the other (maybe the one that appears more often in the corpus)?

What I've tried: I played around with WordNet's Synset tree, with little luck. It turns out that good is an 'adjective', but great is an 'adjective satellite', and therefore return None for path similarity. My thought process was to do the following:

  1. Part of speech tag the sentence
  2. Use these POS to find the correct Synset
  3. Compute similarity of the two Synsets
  4. If they are above some threshold, compute occurrences of both words
  5. Replace the least occurring word with the most occurring word

Ideally, though, I'd like an algorithm that can determine that good and great are similar in my corpus (perhaps in a co-occurring sense), so that it can be extended to words that aren't part of the general English language, but appear in my corpus, and so that it can be extended to n-grams (maybe Oracle and terrible are synonymous in my corpus, or feature engineering and feature creation are similar).

Any suggestions on algorithms, or suggestions to get WordNet synset to behave?

user2979931
  • 101
  • 1
  • 3
  • 1
    These do not convey the same information to me. Great is stronger than good. Also, "good value" suggests a product has an attractive price for its quality level. "Good product" suggests that product is high quality. The latest Mac Pro looks like a great product, I wouldn't call it a great value. One approach would be to ask whether replacing great with good or value with product actually changes some outcome of interest. – ChrisP Jan 06 '14 at 18:20
  • @ChrisP - I see your point. But here's an example of two topics I get: Topic 1 - `['great service', 'good product', 'great price']`, Topic 2 - `['good service', 'quality product', 'good price']`. No human would label these as two distinct topics, and it isn't useful when there are other topics that could be considered. Both `good product` and `great product` describe the product in a positive manner, and you could see how they are more similar from, say `easy reference`. – user2979931 Jan 06 '14 at 18:33
  • @user2979931 did any of the answers answer your question? – alvas Jan 09 '14 at 06:20

2 Answers2

2

If you're going to use WordNet, you have

Problem 1: Word Sense Disambiguation (WSD), i.e. how to automatically determine which synset to use?

>>> for i in wn.synsets('good','a'):
...     print i.name, i.definition
... 
good.a.01 having desirable or positive qualities especially those suitable for a thing specified
full.s.06 having the normally expected amount
good.a.03 morally admirable
estimable.s.02 deserving of esteem and respect
beneficial.s.01 promoting or enhancing well-being
good.s.06 agreeable or pleasing
good.s.07 of moral excellence
adept.s.01 having or showing knowledge and skill and aptitude
good.s.09 thorough
dear.s.02 with or in a close or intimate relationship
dependable.s.04 financially sound
good.s.12 most suitable or right for a particular purpose
good.s.13 resulting favorably
effective.s.04 exerting force or influence
good.s.15 capable of pleasing
good.s.16 appealing to the mind
good.s.17 in excellent physical condition
good.s.18 tending to promote physical well-being; beneficial to health
good.s.19 not forged
good.s.20 not left to spoil
good.s.21 generally admired

>>> for i in wn.synsets('great','a'):
...     print i.name, i.definition
... 
great.s.01 relatively large in size or number or extent; larger than others of its kind
great.s.02 of major significance or importance
great.s.03 remarkable or out of the ordinary in degree or magnitude or effect
bang-up.s.01 very good
capital.s.03 uppercase
big.s.13 in an advanced stage of pregnancy

Let's say you somehow get the correct sense, maybe you tried something like this (https://github.com/alvations/pywsd) and let's say you get the POS and synset right:

good.a.01 having desirable or positive qualities especially those suitable for a thing specified great.s.01 relatively large in size or number or extent; larger than others of its kind

Problem 2: How are you going to compare the 2 synsets?

Let's try similarity functions, but you realized that they give you no score:

>>> good = wn.synsets('good','a')[0]
>>> great = wn.synsets('great','a')[0]
>>> print max(wn.path_similarity(good,great), wn.path_similarity(great, good))
None
>>> print max(wn.wup_similarity(good,great), wn.wup_similarity(great, good))

>>> print max(wn.res_similarity(good,great,semcor_ic), wn.res_similarity(great, good,semcor_ic))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1312, in res_similarity
    return synset1.res_similarity(synset2, ic, verbose)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 738, in res_similarity
    ic1, ic2, lcs_ic = _lcs_ic(self, other, ic)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1643, in _lcs_ic
    (synset1, synset2))
nltk.corpus.reader.wordnet.WordNetError: Computing the least common subsumer requires Synset('good.a.01') and Synset('great.s.01') to have the same part of speech.
>>> print max(wn.jcn_similarity(good,great,semcor_ic), wn.jcn_similarity(great, good,semcor_ic))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1316, in jcn_similarity
    return synset1.jcn_similarity(synset2, ic, verbose)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 759, in jcn_similarity
    ic1, ic2, lcs_ic = _lcs_ic(self, other, ic)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1643, in _lcs_ic
    (synset1, synset2))
nltk.corpus.reader.wordnet.WordNetError: Computing the least common subsumer requires Synset('good.a.01') and Synset('great.s.01') to have the same part of speech.
>>> print max(wn.lin_similarity(good,great,semcor_ic), wn.lin_similarity(great, good,semcor_ic))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1320, in lin_similarity
    return synset1.lin_similarity(synset2, ic, verbose)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 789, in lin_similarity
    ic1, ic2, lcs_ic = _lcs_ic(self, other, ic)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1643, in _lcs_ic
    (synset1, synset2))
nltk.corpus.reader.wordnet.WordNetError: Computing the least common subsumer requires Synset('good.a.01') and Synset('great.s.01') to have the same part of speech.
>>> print max(wn.lch_similarity(good,great), wn.lch_similarity(great, good))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1304, in lch_similarity
    return synset1.lch_similarity(synset2, verbose, simulate_root)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 638, in lch_similarity
    (self, other))
nltk.corpus.reader.wordnet.WordNetError: Computing the lch similarity requires Synset('good.a.01') and Synset('great.s.01') to have the same part of speech.

Let's try a different pair of synsets, since good has both satellite-adjective and adjective while great only have satellite, let's go with the lowest common denominator:

good.s.13 resulting favorably
great.s.01 relatively large in size or number or extent; larger than others of its kind

You realize that there is still no similarity information for comparing between satellite-adjective:

>>> print max(wn.lin_similarity(good,great,semcor_ic), wn.lin_similarity(great, good,semcor_ic))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1320, in lin_similarity
    return synset1.lin_similarity(synset2, ic, verbose)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 789, in lin_similarity
    ic1, ic2, lcs_ic = _lcs_ic(self, other, ic)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1645, in _lcs_ic
    ic1 = information_content(synset1, ic)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1666, in information_content
    raise WordNetError(msg % synset.pos)
nltk.corpus.reader.wordnet.WordNetError: Information content file has no entries for part-of-speech: s
>>> print max(wn.path_similarity(good,great), wn.path_similarity(great, good))None
None

Now seems like WordNet is creating more problems than it's solving anything here, let's try another means, let's try word clustering, see http://en.wikipedia.org/wiki/Word-sense_induction

This is when i also give up on answering the broad and opened question that the OP has posted because there's a LOT done in clustering that are automagics to mere mortals like me =)

alvas
  • 115,346
  • 109
  • 446
  • 738
0

You said (emphasis added):

Ideally, though, I'd like an algorithm that can determine that good and great are similar in my corpus (perhaps in a co-occurring sense)

You can measure word similarity by measuring how often those words appear in the same sentence with other words (that is, co-occurrence). To capture more semantic relatedness, probably you can capture also the collocations, that is, how often the word appear in the same window of words in the neighborhood of the word.

This paper deals with Word Sense Disambiguation (WSD), and it uses collocations and surrounding words (co-occurrence) as part of their feature space. The result is quite good, so I guess you can use the same features for your problem.

In Python, you can use sklearn, especially you may want to look at SVM (with sample codes) to help you get started.

The general idea will be along this line:

  1. Get a pair of bigrams that you want to check for similarity
  2. Using your corpus, generate the collocation and co-occurrence features for each bigram
  3. Train SVM to learn the features of the first bigram
  4. Run SVM on the occurrences of the other bigrams (you get some score here)
  5. Possible use the scores to determine whether the two bigrams are similar to each other
justhalf
  • 8,960
  • 3
  • 47
  • 74