Import the NLTK library and texts from the Project Gutenberg electronic text archive, take the text specified by the option

Question

I need to process a text file chesterton-brown.txt.

Determine the number of words in the text.

Identify the 10 most frequently used words in the text, build a bar chart based on these data.

Remove stop words and punctuation from the text, then again find the 10 most frequently used words in the text and build a bar chart based on them.

I would like to see the text I am processing, I have seen the following function used for this brown = gutenberg.words('chesterton-brown.txt') But it returns 6 words, is there really 6 words in this file?

Also to identify the 10 most used words I need to do tokenization, as far as I understand, then delete the stop words and do it again. But I do not understand how to assign the contents of a text file to a variable to perform these operations. In general, the topic seemed to me very complicated and searching for information does not give me more understanding. It would be great if someone could tell me how it works in general, which functions are better to use.

This is how I downloaded or imported the necessary text file. I just do not understand how to work with it.

from nltk.corpus import gutenberg
import nltk

nltk.download('gutenberg')

brown1 = gutenberg.fileids()
print(brown1)

Sorry, I didn't quite understand what I needed to do. Please, can you be more specific? — Dima, Oct 12 '22 at 09:27
In 'chesterton-brown.txt' ? I don't even know where this file is located. I'm downloading it via IDE (like this: from nltk.corpus import gutenberg import nltk nltk.download('gutenberg')) and I don't know how to read it. — Dima, Oct 12 '22 at 09:35
is this solved your problem? https://stackoverflow.com/questions/70198990/how-to-make-wordcloud-for-books-in-nltk-corpus-gutenberg-fileids — Ramesh, Oct 12 '22 at 09:38

Import the NLTK library and texts from the Project Gutenberg electronic text archive, take the text specified by the option

0 Answers0