Collecting all n-grams (and their frequencies) from document

Question

I want to collect all n-grams from a text and also their frequencies should be counted. These two challenges can be solved in one or in two python files. This is what I already have. Now this should work for a .txt file instead of putting in the sentence.

from nltk import ngrams

sentence = 'Hello, this is an example'

n = 3
threegrams = ngrams(sentence.split(), n)

for grams in threegrams:
  print (grams)

Does [this previous SO post](https://stackoverflow.com/questions/58327404/n-gram-frequency-python-ntlk) help? — Frodnar, Apr 05 '21 at 13:16

score 0 · Answer 1 · edited Apr 05 '21 at 16:07

I found a good answer here and can break it down for you. Your goal could be met in just one file.

First, import these nltk libraries:

import nltk
from nltk.collocations import *
from nltk.tokenize import word_tokenize

Collocations are expressions of multiple words which commonly co-occur, which is why the nltk.collocations library will help with finding their frequencies. The word.tokenize tool is just a different way of performing sentence.split that utilizes tools readily available in the nltk package.
(If you get an output error about missing these packages, check this out)

Here is the sentence I used just to see what my script would do with trigrams:

sentence = "Hello, this is an example. This is an example of the trigram count. The trigram count is neat"

To instead read a txt file, replace that line with this:

myFile = open("file.txt", 'r').read()

Next, we are going to tokenize and collocate each trigram:

trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = TrigramCollocationFinder.from_words(word_tokenize(sentence)) 
#for txt files: replace the term 'sentence' with 'myFile'

Finally, we print the trigrams and their frequencies:

for i in finder.score_ngrams(trigram_measures.raw_freq):
    print(i)

raw_freq is a method for the TrigramAssocMeasures() class, where you can apply different methods to the trigrams other than frequency.

Here was my output:

(('is', 'an', 'example'), 0.09523809523809523)
((',', 'this', 'is'), 0.047619047619047616)
(('.', 'The', 'trigram'), 0.047619047619047616)
(('.', 'This', 'is'), 0.047619047619047616)
(('Hello', ',', 'this'), 0.047619047619047616)
(('The', 'trigram', 'count'), 0.047619047619047616)
(('This', 'is', 'an'), 0.047619047619047616)
(('an', 'example', '.'), 0.047619047619047616)
(('an', 'example', 'of'), 0.047619047619047616)
(('count', '.', 'The'), 0.047619047619047616)
(('count', 'is', 'neat'), 0.047619047619047616)
(('example', '.', 'This'), 0.047619047619047616)
(('example', 'of', 'the'), 0.047619047619047616)
(('of', 'the', 'trigram'), 0.047619047619047616)
(('the', 'trigram', 'count'), 0.047619047619047616)
(('this', 'is', 'an'), 0.047619047619047616)
(('trigram', 'count', '.'), 0.047619047619047616)
(('trigram', 'count', 'is'), 0.047619047619047616)

Collecting all n-grams (and their frequencies) from document

1 Answers1