To find frequency of every word in text file in python

Question

I want to find frequency of all words in my text file so that i can find out most frequently occuring words from them. Can someone please help me the command to be used for that.

import nltk
text1 = "hello he heloo hello hi " // example text
 fdist1 = FreqDist(text1)

I have used above code but problem is that it is not giving word frequency,rather it is displaying frequency of every character. Also i want to know how to input text using text file.

related: [FreqDist with NLTK](http://stackoverflow.com/q/4634787/4279) — jfs, Mar 14 '15 at 19:41

heinst · Accepted Answer · 2015-03-14T20:17:26.133

5

I saw you were using the example and saw the same thing you were seeing, in order for it to work properly, you have to split the string by spaces. If you do not do this, it seems to count each character, which is what you were seeing. This returns the proper counts of each word, not character.

import nltk

text1 = 'hello he heloo hello hi '
text1 = text1.split(' ')
fdist1 = nltk.FreqDist(text1)
print (fdist1.most_common(50))

If you want to read from a file and get the word count, you can do it like so:

input.txt

hello he heloo hello hi
my username is heinst
your username is frooty

python code

import nltk

with open ("input.txt", "r") as myfile:
    data=myfile.read().replace('\n', ' ')

data = data.split(' ')
fdist1 = nltk.FreqDist(data)
print (fdist1.most_common(50))

edited Mar 14 '15 at 20:17

answered Mar 14 '15 at 18:24

heinst

8,520
7
41
77

`print` needs fixing. It's python3.x. – user Mar 14 '15 at 20:07
hey thanks heinst. Can you tell where to save this input file..in python folder or anywhere ? – frooty Mar 15 '15 at 11:45
hey thanks @heinst Can you tell where to save this input file..in python folder or anywhere ? – frooty Mar 15 '15 at 11:54
@heinst : I am getting invalid syntax while reading from file. Can yyou suggest changes to be done .tia – frooty Mar 15 '15 at 12:18

Boa · Answer 2 · 2017-10-31T15:43:52.147

4

For what it's worth, NLTK seems like overkill for this task. The following will give you word frequencies, in order from highest to lowest.

from collections import Counter
input_string = [...] # get the input from a file
word_freqs = Counter(input_string.split())

edited Oct 31 '17 at 15:43

answered Mar 14 '15 at 18:15

Boa

2,609
1
23
38

1

`print` needs fixing. It's python3.x. – user Mar 14 '15 at 20:07
hey @Boa please tell how to take input from file .the code is perfectly working thanks for that . – frooty Mar 15 '15 at 11:53

score 2 · Answer 3 · edited May 23 '17 at 12:09

text1 in the nltk book is a collection of tokens (words, punctuation) unlike in your code example where text1 is a string (collection of Unicode codepoints):

>>> from nltk.book import text1
>>> text1
<Text: Moby Dick by Herman Melville 1851>
>>> text1[99] # 100th token in the text
','
>>> from nltk import FreqDist
>>> FreqDist(text1)
FreqDist({',': 18713, 'the': 13721, '.': 6862, 'of': 6536, 'and': 6024,
          'a': 4569, 'to': 4542, ';': 4072, 'in': 3916, 'that': 2982, ...})

If your input is indeed space-separated words then to find the frequency, use @Boa's answer:

freq = Counter(text_with_space_separated_words.split())

Note: FreqDist is a Counter but it also defines additional methods such as .plot().

If you want to use nltk tokenizers instead:

#!/usr/bin/env python3
from itertools import chain
from nltk import FreqDist, sent_tokenize, word_tokenize # $ pip install nltk

with open('your_text.txt') as file:
    text = file.read()
words = chain.from_iterable(map(word_tokenize, sent_tokenize(text)))
freq = FreqDist(map(str.casefold, words))
freq.pprint()
# -> FreqDist({'hello': 2, 'hi': 1, 'heloo': 1, 'he': 1})

sent_tokenize() tokenizes the text into sentences. Then word_tokenize tokenizes each sentence into words. There are many ways to tokenize text in nltk.

score 1 · Answer 4 · answered Nov 30 '15 at 11:36

In order to have the frequency as well as the words as a dictionary, the following code will be beneficial:

import nltk
from nltk.tokenize import word_tokenize  

for f in word_tokenize(inputSentence):  
     dict[f] = fre[f]                                                  

print dict

score 0 · Answer 5 · answered Apr 20 '20 at 17:22

I think below code is useful for you to get the frequency of each word in the file in dictionary form

myfile=open('greet.txt')
temp=myfile.read()
x=temp.split("\n")
y=list()
for item in x:
   z=item.split(" ")
   y.append(z)
count=dict()
for name in y:
   for items in name:
       if items not in count:`enter code here`
        count[items]=1
      else:
        count[items]=count[items]+1
print(count)

To find frequency of every word in text file in python

5 Answers5

input.txt

python code

Linked