2

I want to find frequency of all words in my text file so that i can find out most frequently occuring words from them. Can someone please help me the command to be used for that.

import nltk
text1 = "hello he heloo hello hi " // example text
 fdist1 = FreqDist(text1) 

I have used above code but problem is that it is not giving word frequency,rather it is displaying frequency of every character. Also i want to know how to input text using text file.

Dan
  • 5,153
  • 4
  • 31
  • 42
frooty
  • 69
  • 1
  • 1
  • 7

5 Answers5

5

I saw you were using the example and saw the same thing you were seeing, in order for it to work properly, you have to split the string by spaces. If you do not do this, it seems to count each character, which is what you were seeing. This returns the proper counts of each word, not character.

import nltk

text1 = 'hello he heloo hello hi '
text1 = text1.split(' ')
fdist1 = nltk.FreqDist(text1)
print (fdist1.most_common(50))

If you want to read from a file and get the word count, you can do it like so:

input.txt

hello he heloo hello hi
my username is heinst
your username is frooty

python code

import nltk

with open ("input.txt", "r") as myfile:
    data=myfile.read().replace('\n', ' ')

data = data.split(' ')
fdist1 = nltk.FreqDist(data)
print (fdist1.most_common(50))
heinst
  • 8,520
  • 7
  • 41
  • 77
4

For what it's worth, NLTK seems like overkill for this task. The following will give you word frequencies, in order from highest to lowest.

from collections import Counter
input_string = [...] # get the input from a file
word_freqs = Counter(input_string.split())
Boa
  • 2,609
  • 1
  • 23
  • 38
2

text1 in the nltk book is a collection of tokens (words, punctuation) unlike in your code example where text1 is a string (collection of Unicode codepoints):

>>> from nltk.book import text1
>>> text1
<Text: Moby Dick by Herman Melville 1851>
>>> text1[99] # 100th token in the text
','
>>> from nltk import FreqDist
>>> FreqDist(text1)
FreqDist({',': 18713, 'the': 13721, '.': 6862, 'of': 6536, 'and': 6024,
          'a': 4569, 'to': 4542, ';': 4072, 'in': 3916, 'that': 2982, ...})

If your input is indeed space-separated words then to find the frequency, use @Boa's answer:

freq = Counter(text_with_space_separated_words.split())

Note: FreqDist is a Counter but it also defines additional methods such as .plot().

If you want to use nltk tokenizers instead:

#!/usr/bin/env python3
from itertools import chain
from nltk import FreqDist, sent_tokenize, word_tokenize # $ pip install nltk

with open('your_text.txt') as file:
    text = file.read()
words = chain.from_iterable(map(word_tokenize, sent_tokenize(text)))
freq = FreqDist(map(str.casefold, words))
freq.pprint()
# -> FreqDist({'hello': 2, 'hi': 1, 'heloo': 1, 'he': 1})

sent_tokenize() tokenizes the text into sentences. Then word_tokenize tokenizes each sentence into words. There are many ways to tokenize text in nltk.

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670
1

In order to have the frequency as well as the words as a dictionary, the following code will be beneficial:

import nltk
from nltk.tokenize import word_tokenize  

for f in word_tokenize(inputSentence):  
     dict[f] = fre[f]                                                  

print dict
Dibin Joseph
  • 251
  • 3
  • 6
0

I think below code is useful for you to get the frequency of each word in the file in dictionary form

myfile=open('greet.txt')
temp=myfile.read()
x=temp.split("\n")
y=list()
for item in x:
   z=item.split(" ")
   y.append(z)
count=dict()
for name in y:
   for items in name:
       if items not in count:`enter code here`
        count[items]=1
      else:
        count[items]=count[items]+1
print(count)