0

I know nltk can tell you the likelihood of a word within a given context nltk language model (ngram) calculate the prob of a word from context

But can it tell you the count (or likelyhood) of a given ngram within the Brown corpus? For instance, can it tell you the number of times that the phrase "chocolate milkshake" occurs in the brown corpus?

I know you can do this with google ngrams but the data is a little unwieldy. I am wondering if there is a way to do it w/ simple NLTK.

Community
  • 1
  • 1
bernie2436
  • 22,841
  • 49
  • 151
  • 244

2 Answers2

3
from collections import Counter

from nltk.corpus import brown
from nltk.util import ngrams

n = 2
bigrams = ngrams(brown.words(), n)
bigrams_freq = Counter(bigrams)

print bigrams_freq[('chocolate', 'milkshake')]
print bigrams_freq.most_common()[2000]

[out]:

0
(('beginning', 'of'), 42)
alvas
  • 115,346
  • 109
  • 446
  • 738
0

Using nltk.bigrams(<tokenizedtext>), its easy to count them. make an empty dictionary, iterate through the bigrams list, and add or update the count for each bigram (the dictionary will be of form {<bigram>: <count>}). Once you have this dictionary, just look up any bigram you are interested in with dict[<bigram>]

an example, assuming the brown tokens are in a list brown_bigrams:

frequencies = {}
for ngram in brown_bigrams:
    if ngram in frequencies:
        frequencies[ngram] += 1
    else:
        frequencies[ngram] = 1

#frequency of ('chocolate', 'milkshake')
print frequencies[('chocolate', 'milkshake')]

Lgiro
  • 762
  • 5
  • 13