0

I am using the NLTK.corpus module in Python (3.6.3) to build and analyze a corpus I have created. This corpus consists of several hundred documents. To access the content of a document in the corpus, I use the .raw command but this is throwing an decoding error.

fileids = newcorpus.fileids() *newcorpus is the PlaintextCorpusReader object I have created

for f in fileids:
    if f not in normalized_docs:
        p = newcorpus.raw(f)

The error I am receiving is the following:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xeb in position 200: invalid continuation byte

What can I do to prevent this from happening?

Roald Schuring
  • 179
  • 1
  • 3
  • 13
  • Check the encoding of your corpus. Is it latin-1 or UTF8? – alvas Jan 04 '18 at 16:46
  • There's an encoding argument to specify the encoding of your documents in the PlaintextCorpusReader , see answer in https://stackoverflow.com/questions/4951751/creating-a-new-corpus-with-nltk – alvas Jan 04 '18 at 16:47
  • 1
    Thanks alvas, this is really helpful. The link you shared was exactly the information I needed. Changing the encoding argument to 'Latin-1' did the trick. – Roald Schuring Jan 04 '18 at 16:57

0 Answers0