Count of an ngram in the brown news corpus?

Question

I know nltk can tell you the likelihood of a word within a given context nltk language model (ngram) calculate the prob of a word from context

But can it tell you the count (or likelyhood) of a given ngram within the Brown corpus? For instance, can it tell you the number of times that the phrase "chocolate milkshake" occurs in the brown corpus?

I know you can do this with google ngrams but the data is a little unwieldy. I am wondering if there is a way to do it w/ simple NLTK.

score 3 · Answer 1 · answered Jun 08 '14 at 07:58

from collections import Counter

from nltk.corpus import brown
from nltk.util import ngrams

n = 2
bigrams = ngrams(brown.words(), n)
bigrams_freq = Counter(bigrams)

print bigrams_freq[('chocolate', 'milkshake')]
print bigrams_freq.most_common()[2000]

[out]:

0
(('beginning', 'of'), 42)

score 0 · Answer 2 · answered Jun 08 '14 at 05:17

Using nltk.bigrams(<tokenizedtext>), its easy to count them. make an empty dictionary, iterate through the bigrams list, and add or update the count for each bigram (the dictionary will be of form {<bigram>: <count>}). Once you have this dictionary, just look up any bigram you are interested in with dict[<bigram>]

an example, assuming the brown tokens are in a list brown_bigrams:

frequencies = {}
for ngram in brown_bigrams:
    if ngram in frequencies:
        frequencies[ngram] += 1
    else:
        frequencies[ngram] = 1

#frequency of ('chocolate', 'milkshake')
print frequencies[('chocolate', 'milkshake')]

Count of an ngram in the brown news corpus?

2 Answers2