Background:
I am trying to compare pairs of words to see which pair is "more likely to occur" in US English than another pair. My plan is/was to use the collocation facilities in NLTK to score word pairs, with the higher scoring pair being the most likely.
Approach:
I coded the following in Python using NLTK (several steps and imports removed for brevity):
bgm = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(tokens)
scored = finder.score_ngrams( bgm.likelihood_ratio )
print scored
Results:
I then examined the results using 2 word pairs, one of which should be highly likely to co-occur, and one pair which should not ("roasted cashews" and "gasoline cashews"). I was surprised to see these word pairing score identically:
[(('roasted', 'cashews'), 5.545177444479562)]
[(('gasoline', 'cashews'), 5.545177444479562)]
I would have expected 'roasted cashews' to score higher than 'gasoline cashews' in my test.
Questions:
- Am I misunderstanding the use of collocations?
- Is my code incorrect?
- Is my assumption that the scores should be different wrong, and if so why?
Thank you very much for any information or help!