NLTK frequency distribution for group of words

Question

Could you please help me how to calculate frequency distribution of "group of words"?

In other words, I have a text file. Here is a snapshot:

a snapshot of it is given here

Here is my code to find the 50 most common words in the text file:

f=open('myfile.txt','rU')
text=f.read()
text1=text.split()
keywords=nltk.Text(text1)
fdist1=FreqDist(keywords)
fdist1.most_common(50)

In the results, as you can see in the link, each word is calculated. Here is a screenshot of the results:

a screenshot of the results

It works well, but I am trying to find the frequency distribution of each line in the text file. For instance, in the first line, there is a term 'conceptual change'. The program calculates 'conceptual' and 'change' as different keywords. However, I need to find the frequency distribution of the term 'conceptual change'.

Welcome to stack-overflow. You can improve your question a little by providing some information about what you tried and where you got stuck. — Willem, Jun 10 '17 at 11:24
Also: What does the input file actually look like? Show a few lines. — alexis, Jun 10 '17 at 12:20

score 1 · Answer 1 · answered Jun 13 '17 at 18:55

You're splitting up the text by any whitespace. See the docs, this is default behavior when you do not give any separator.

If you were to print out the value of text1 in your example program, you would see this. It's simply a list of words -- not lines -- so the damage has already been done by the time it's passed to FreqDist.

To fix it, just replace with text.split("\n"):

import nltk
from nltk import FreqDist
f=open('myfile.txt','rU')
text=f.read()
text1=text.split("\n")
keywords=nltk.Text(text1)
print(type(keywords))
fdist1=FreqDist(keywords)
print(fdist1.most_common(50))

This gives an output like:

[('conceptual change', 1), ('coherence', 1), ('cost-benefit tradeoffs', 1), ('interactive behavior', 1), ('naive physics', 1), ('rationality', 1), ('suboptimal performance', 1)]

hey I have a somewhat similar question and I was wondering if you could help me out. I am not splitting the words as the OP per se, I am using an NLTK Text as input and I get the words 'heart' and 'rate' separately. Is there a way around this?Thanks in advance! https://stackoverflow.com/questions/45531514/nltk-freqdist-counting-two-words-as-one — tech4242, Aug 13 '17 at 18:33

NLTK frequency distribution for group of words

1 Answers1