text1
in the nltk book is a collection of tokens (words, punctuation) unlike in your code example where text1
is a string (collection of Unicode codepoints):
>>> from nltk.book import text1
>>> text1
<Text: Moby Dick by Herman Melville 1851>
>>> text1[99] # 100th token in the text
','
>>> from nltk import FreqDist
>>> FreqDist(text1)
FreqDist({',': 18713, 'the': 13721, '.': 6862, 'of': 6536, 'and': 6024,
'a': 4569, 'to': 4542, ';': 4072, 'in': 3916, 'that': 2982, ...})
If your input is indeed space-separated words then to find the frequency, use @Boa's answer:
freq = Counter(text_with_space_separated_words.split())
Note: FreqDist
is a Counter
but it also defines additional methods such as .plot()
.
If you want to use nltk
tokenizers instead:
#!/usr/bin/env python3
from itertools import chain
from nltk import FreqDist, sent_tokenize, word_tokenize # $ pip install nltk
with open('your_text.txt') as file:
text = file.read()
words = chain.from_iterable(map(word_tokenize, sent_tokenize(text)))
freq = FreqDist(map(str.casefold, words))
freq.pprint()
# -> FreqDist({'hello': 2, 'hi': 1, 'heloo': 1, 'he': 1})
sent_tokenize()
tokenizes the text into sentences. Then word_tokenize
tokenizes each sentence into words. There are many ways to tokenize text in nltk
.