34

The Python package nltk has the FreqDist function which gives you the frequency of words within a text. I am trying to pass my text as an argument but the result is of the form:

[' ', 'e', 'a', 'o', 'n', 'i', 't', 'r', 's', 'l', 'd', 'h', 'c', 'y', 'b', 'u', 'g', '\n', 'm', 'p', 'w', 'f', ',', 'v', '.', "'", 'k', 'B', '"', 'M', 'H', '9', 'C', '-', 'N', 'S', '1', 'A', 'G', 'P', 'T', 'W', '[', ']', '(', ')', '0', '7', 'E', 'J', 'O', 'R', 'j', 'x']

whereas in the example on the nltk website, the result was whole words not characters. Here is how I am currently using the function:

file_y = open(fileurl)
p = file_y.read()
fdist = FreqDist(p)
vocab = fdist.keys()
vocab[:100]

What I am doing wrong?

Michael Delgado
  • 13,789
  • 3
  • 29
  • 54
afg102
  • 361
  • 2
  • 4
  • 4

6 Answers6

50

FreqDist expects an iterable of tokens. A string is iterable --- the iterator yields every character.

Pass your text to a tokenizer first, and pass the tokens to FreqDist.

Alex Brasetvik
  • 11,218
  • 2
  • 35
  • 36
  • Indeed it does, but its docstring doesn't say that *anywhere*, nor do its error messages, and it would be trivial for its `__init__()` to either raise an error message saying so on non-iterator input, or accept a sequence and convert it to an iterator. – smci Jul 28 '13 at 05:23
  • 1
    @afg102 If it has worked, please accept the answer so that others also know what is the solution to the problem. – rishi Aug 08 '13 at 09:44
33

FreqDist runs on an array of tokens. You're sending it a an array of characters (a string) where you should have tokenized the input first:

words = nltk.tokenize.word_tokenize(p)
fdist = FreqDist(words)
Eran Kampf
  • 8,928
  • 8
  • 49
  • 47
23

NLTK's FreqDist accepts any iterable. As a string is iterated character by character, it is pulling things apart in the way that you're experiencing.

In order to do count words, you need to feed FreqDist words. How do you do that? Well, you might think (as others have suggested in the answer to your question) to feed the whole file to nltk.tokenize.word_tokenize.

>>> # first, let's import the dependencies
>>> import nltk
>>> from nltk.probability import FreqDist

>>> # wrong :(
>>> words = nltk.tokenize.word_tokenize(p)
>>> fdist = FreqDist(words)

word_tokenize builds word models from sentences. It needs to be fed each sentence one at a time. It will do a relatively poor job when given whole paragraphs or even documents.

So, what to do? Easy, add in a sentence tokenizer!

>>> fdist = FreqDist()
>>> for sentence in nltk.tokenize.sent_tokenize(p):
...     for word in nltk.tokenize.word_tokenize(sentence):
>>>         fdist[word] += 1

One thing to bear in mind is that there are many ways to tokenize text. The modules nltk.tokenize.sent_tokenize and nltk.tokenize.word_tokenize simply pick a reasonable default for relatively clean, English text. There are several other options to chose from, which you can read about in the API documentation.

Tim McNamara
  • 18,019
  • 4
  • 52
  • 83
  • The OP doesn't want letter frequencies! (noone else does either...) They want word frequencies. – smci Jul 28 '13 at 04:57
  • 1
    Actually, letter frequencies are very common features for automatic language detection. – Tim McNamara Jul 29 '13 at 09:56
  • True, for that niche. Also decryption. In general not much though. – smci Aug 01 '13 at 03:44
  • 1
    Really useful answer, however it seems to be a bit outdated: `AttributeError: 'FreqDist' object has no attribute 'inc'`. Not complaining, just throwing it out there for others to be aware of this. I'll try to figure out an answer to this ;) Thanks – Aleksander Lidtke Jun 19 '16 at 13:08
  • Yes, things have changed quite a bit to some of the internal NLTK APIs in the last 5 years! Will update the code :) – Tim McNamara Jun 23 '16 at 21:54
  • Hei Tim, you're missing a colon at the end of the second `for` loop statement. – martin-martin Sep 24 '17 at 07:33
9

You simply have to use it like this:

import nltk
from nltk.probability import FreqDist

sentence='''This is my sentence'''
tokens = nltk.tokenize.word_tokenize(sentence)
fdist=FreqDist(tokens)

The variable fdist is of the type "class 'nltk.probability.FreqDist" and contains the frequency distribution of words.

Community
  • 1
  • 1
Aakash Anuj
  • 3,773
  • 7
  • 35
  • 47
1
Your_string = "here is my string"
tokens = Your_string.split()

Do this way, and then use the NLTK functions

it will give your tokens in words but not in characters

Musadiq
  • 29
  • 1
  • 5
0
text_dist = nltk.FreqDist(word for word in list(text) if word.isalpha())
top1_text1 = text_dist.max()
maxfreq = top1_text1
  • 7
    While this code may answer the question, it would be better to explain how it solves the problem without introducing others and why to use it. Code-only answers are not useful in the long run. – jnovack Oct 03 '20 at 16:29