0

I have crawled pdf,html,doc files using Apache Tika and stored structured text into text files.These text files contain some unusual special characters,because of these special characters i am unable to read those text files.I have below code snippet to read the files

fo = codecs.open('/var/www/testfiles/sample.txt','r','utf-8').read()

But,I am getting following error

UnicodeDecodeError: 'utf8' codec can't decode byte 0xb7 in position 1291: invalid start byte

Please,suggest me how to read my text files. Thanks

user2609542
  • 801
  • 4
  • 13
  • 20

1 Answers1

0

You'll need to set the 'errors' keyword parameter to something other than the default strict. You can find a list of possibilities (for Python 3.3) here. The list is enumerated in the "codecs.register()" documentation.

I'd start with the 'replace' option just so you can see what you're dealing with.

Codie CodeMonkey
  • 7,669
  • 2
  • 29
  • 45