I need to process a text file chesterton-brown.txt.
Determine the number of words in the text.
Identify the 10 most frequently used words in the text, build a bar chart based on these data.
Remove stop words and punctuation from the text, then again find the 10 most frequently used words in the text and build a bar chart based on them.
I would like to see the text I am processing, I have seen the following function used for this brown = gutenberg.words('chesterton-brown.txt') But it returns 6 words, is there really 6 words in this file?
Also to identify the 10 most used words I need to do tokenization, as far as I understand, then delete the stop words and do it again. But I do not understand how to assign the contents of a text file to a variable to perform these operations. In general, the topic seemed to me very complicated and searching for information does not give me more understanding. It would be great if someone could tell me how it works in general, which functions are better to use.
This is how I downloaded or imported the necessary text file. I just do not understand how to work with it.
from nltk.corpus import gutenberg
import nltk
nltk.download('gutenberg')
brown1 = gutenberg.fileids()
print(brown1)