27

Background:

I am trying to compare pairs of words to see which pair is "more likely to occur" in US English than another pair. My plan is/was to use the collocation facilities in NLTK to score word pairs, with the higher scoring pair being the most likely.

Approach:

I coded the following in Python using NLTK (several steps and imports removed for brevity):

bgm    = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(tokens)
scored = finder.score_ngrams( bgm.likelihood_ratio  )
print scored

Results:

I then examined the results using 2 word pairs, one of which should be highly likely to co-occur, and one pair which should not ("roasted cashews" and "gasoline cashews"). I was surprised to see these word pairing score identically:

[(('roasted', 'cashews'), 5.545177444479562)]
[(('gasoline', 'cashews'), 5.545177444479562)]

I would have expected 'roasted cashews' to score higher than 'gasoline cashews' in my test.

Questions:

  1. Am I misunderstanding the use of collocations?
  2. Is my code incorrect?
  3. Is my assumption that the scores should be different wrong, and if so why?

Thank you very much for any information or help!

Doug T.
  • 64,223
  • 27
  • 138
  • 202
ccgillett
  • 4,511
  • 4
  • 21
  • 14
  • One additional comment: Grouping all 4 words together, viz 'roasted cashews gasoline cashew', gave similar results in that all the bigram scores were identical. – ccgillett Dec 30 '11 at 20:12

1 Answers1

34

The NLTK collocations document seems pretty good to me. http://www.nltk.org/howto/collocations.html

You need to give the scorer some actual sizable corpus to work with. Here is a working example using the Brown corpus built into NLTK. It takes about 30 seconds to run.

import nltk.collocations
import nltk.corpus
import collections

bgm    = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_words(
    nltk.corpus.brown.words())
scored = finder.score_ngrams( bgm.likelihood_ratio  )

# Group bigrams by first word in bigram.                                        
prefix_keys = collections.defaultdict(list)
for key, scores in scored:
   prefix_keys[key[0]].append((key[1], scores))

# Sort keyed bigrams by strongest association.                                  
for key in prefix_keys:
   prefix_keys[key].sort(key = lambda x: -x[1])

print 'doctor', prefix_keys['doctor'][:5]
print 'baseball', prefix_keys['baseball'][:5]
print 'happy', prefix_keys['happy'][:5]

The output seems reasonable, works well for baseball, less so for doctor and happy.

doctor [('bills', 35.061321987405748), (',', 22.963930079491501), 
  ('annoys', 19.009636692022365), 
  ('had', 16.730384189212423), ('retorted', 15.190847940499127)]

baseball [('game', 32.110754519752291), ('cap', 27.81891372457088), 
  ('park', 23.509042621473505), ('games', 23.105033513054011), 
  ("player's",    16.227872863424668)]

happy [("''", 20.296341424483998), ('Spahn', 13.915820697905589), 
 ('family', 13.734352182441569), 
 (',', 13.55077617193821), ('bodybuilder', 13.513265447290536)
Binoy Dalal
  • 866
  • 10
  • 25
Rob Neuhaus
  • 9,190
  • 3
  • 28
  • 37
  • 1
    Ok, this explains some of my misunderstanding. Is there a convenient way to search for a bigram and get a relative score? Still looking for a usage pattern that will let me check a given bigram for relevance. And thanks for your answer, very helpful! – ccgillett Dec 30 '11 at 21:04
  • You can either use the code as is with a large corpus and keep the scores in a big bigram keyed dictionary, or maintain somewhat more raw unigram and bigram frequency counts (nltk calls these FreqDist) that you feed into the builtin bigram scorers when you want to compare particular bigrams. – Rob Neuhaus Dec 30 '11 at 21:13
  • 1
    Thanks! I got a very cool solution running using a custom corpus last night. It's doing a good job on some difficult subject matter. Thanks for unblocking me! – ccgillett Dec 31 '11 at 17:49