Reading corpus text with nltk.corpus.reader.plaintext - Python 3

Asked Jan 04 '18 at 15:31

Active Jan 04 '18 at 16:32

Viewed 405 times

I am using the NLTK.corpus module in Python (3.6.3) to build and analyze a corpus I have created. This corpus consists of several hundred documents. To access the content of a document in the corpus, I use the .raw command but this is throwing an decoding error.

fileids = newcorpus.fileids() *newcorpus is the PlaintextCorpusReader object I have created

for f in fileids:
    if f not in normalized_docs:
        p = newcorpus.raw(f)

The error I am receiving is the following:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xeb in position 200: invalid continuation byte

What can I do to prevent this from happening?

edited Jan 04 '18 at 16:32

asked Jan 04 '18 at 15:31

Roald Schuring

Check the encoding of your corpus. Is it latin-1 or UTF8? – alvas Jan 04 '18 at 16:46
There's an encoding argument to specify the encoding of your documents in the PlaintextCorpusReader , see answer in https://stackoverflow.com/questions/4951751/creating-a-new-corpus-with-nltk – alvas Jan 04 '18 at 16:47
1

Thanks alvas, this is really helpful. The link you shared was exactly the information I needed. Changing the encoding argument to 'Latin-1' did the trick. – Roald Schuring Jan 04 '18 at 16:57

Reading corpus text with nltk.corpus.reader.plaintext - Python 3

0 Answers0