I found a good answer here and can break it down for you. Your goal could be met in just one file.
First, import these nltk
libraries:
import nltk
from nltk.collocations import *
from nltk.tokenize import word_tokenize
Collocations are expressions of multiple words which commonly co-occur, which is why the nltk.collocations
library will help with finding their frequencies.
The word.tokenize
tool is just a different way of performing sentence.split
that utilizes tools readily available in the nltk package.
(If you get an output error about missing these packages, check this out)
Here is the sentence I used just to see what my script would do with trigrams:
sentence = "Hello, this is an example. This is an example of the trigram count. The trigram count is neat"
To instead read a txt file, replace that line with this:
myFile = open("file.txt", 'r').read()
Next, we are going to tokenize and collocate each trigram:
trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = TrigramCollocationFinder.from_words(word_tokenize(sentence))
#for txt files: replace the term 'sentence' with 'myFile'
Finally, we print the trigrams and their frequencies:
for i in finder.score_ngrams(trigram_measures.raw_freq):
print(i)
raw_freq
is a method for the TrigramAssocMeasures() class, where you can apply different methods to the trigrams other than frequency.
Here was my output:
(('is', 'an', 'example'), 0.09523809523809523)
((',', 'this', 'is'), 0.047619047619047616)
(('.', 'The', 'trigram'), 0.047619047619047616)
(('.', 'This', 'is'), 0.047619047619047616)
(('Hello', ',', 'this'), 0.047619047619047616)
(('The', 'trigram', 'count'), 0.047619047619047616)
(('This', 'is', 'an'), 0.047619047619047616)
(('an', 'example', '.'), 0.047619047619047616)
(('an', 'example', 'of'), 0.047619047619047616)
(('count', '.', 'The'), 0.047619047619047616)
(('count', 'is', 'neat'), 0.047619047619047616)
(('example', '.', 'This'), 0.047619047619047616)
(('example', 'of', 'the'), 0.047619047619047616)
(('of', 'the', 'trigram'), 0.047619047619047616)
(('the', 'trigram', 'count'), 0.047619047619047616)
(('this', 'is', 'an'), 0.047619047619047616)
(('trigram', 'count', '.'), 0.047619047619047616)
(('trigram', 'count', 'is'), 0.047619047619047616)