1

I have several texts that I found collocations for, and now I'd like to create a table that shows how many times each collocation appears in each text of the corpus.

When I generate a table or a plot from the ConditionalFreqDist,it shows only 1 match for each collocation in each text.

I'm new in Python and, apparently, do something wrong... Please help.

Here is how I get collocations:

>>> import nltk
>>> from nltk.corpus import PlaintextCorpusReader
>>> eng_corpus_root = 'D:\Corpus\EN'
>>> eng_corpus = PlaintextCorpusReader(eng_corpus_root, '.*')
>>> # Below: this is the script that imports corpora for 4 languages from a local folder
>>> from Import4Corpuses3 import *
>>> import nltk
>>> # Below: tengc_low is the variable for English corpus (60 texts) as text objects, all letters changed to lowercase
>>> tengc_low.collocation_list()
['hong kong', 'united states', 'getty images', 'european union', 'prime minister', 'northern ireland', 'boris johnson', 'cape dorset', 'extinction rebellion', 'extradition bill', 'cease fire', 'islamic state', 'recep tayyip', 'turkish backed', 'vice president', 'mike pence', 'tayyip erdogan', 'twitter com', 'pic twitter', 'anthony kwan']

Here is how I try to get my ConditionalFreqDist for collocations and text names:

>>> cfd = nltk.ConditionalFreqDist(
    (textname, collocation)
    for textname in eng_corpus.fileids()
    for collocation in Text(eng_corpus.words()).collocation_list(num=100))

Then I get, as said, "1" for each collocation in each text.

How can I get the correct distribution?

Would be grateful for any advice.

Gavrk
  • 295
  • 1
  • 4
  • 16

0 Answers0