2

I am a newbie using python. Now I am doing natural language processing for a novel, and I choose to load the book from nltk.corpus.gutenberg.fileids(). I just use 'Sense and Sensibility'. Then I want to analyze each chapter. How to split the whole book into parts? I notice that the books loaded this way has unique format. It's not like txt format.

import nltk
nltk.download('gutenberg')
nltk.corpus.gutenberg.fileids()

When I print the book out, it shows: ['[', 'Sense', 'and', 'Sensibility', 'by', 'Jane', ...]

sense = nltk.Text(nltk.corpus.gutenberg.words('austen-sense.txt'))
print(sense)

Then here is another format: <Text: Sense and Sensibility by Jane Austen 1811> I don't know what it means.

If I use another .txt book source, I also don't know how to split the chapters. I've uploaded the book into the folder, then:

text = 'senseText.txt'
Freda Yu
  • 21
  • 1

1 Answers1

0

It's not like txt format.

If you want something more like the whole text, try:

raw = nltk.Text(nltk.corpus.gutenberg.raw('austen-sense.txt'))

If you want individual sentences, you can use:

sentences = nltk.Text(nltk.corpus.gutenberg.sents('austen-sense.txt'))

Gutenberg doesn't break up the text by chapters for you. (Many of the original sources didn't have chapters to begin with.) If your specific text happens to include chapter breaks in the raw, you could try searching for those, but it'd be text-specific.