I am using the NLTK.corpus module in Python (3.6.3) to build and analyze a corpus I have created. This corpus consists of several hundred documents. To access the content of a document in the corpus, I use the .raw command but this is throwing an decoding error.
fileids = newcorpus.fileids() *newcorpus is the PlaintextCorpusReader object I have created
for f in fileids:
if f not in normalized_docs:
p = newcorpus.raw(f)
The error I am receiving is the following:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xeb in position 200: invalid continuation byte
What can I do to prevent this from happening?