7

I want to open a text file (.dat) in python and I get the following error: 'utf-8' codec can't decode byte 0x92 in position 4484: invalid start byte but the file is encoded using utf-8, so maybe there some character that cannot be read. I am wondering, is there a way to handle the problem without calling each single weird characters? Cause I have a rather huge text file and it would take me hours to run find the non encoded Utf-8 encoded character.

Here is my code

import codecs
f = codecs.open('compounds.dat', encoding='utf-8')
for line in f:
    if "InChI=1S/C11H8O3/c1-6-5-9(13)10-7(11(6)14)3-2-4-8(10)12/h2-5" in line:
        print(line)
searchfile.close()
martineau
  • 119,623
  • 25
  • 170
  • 301
StudentOIST
  • 189
  • 2
  • 7
  • 21
  • Yes, but I think it is just the first non encoded UTF-8 character that encounters in the file, therefore I assume that once I correct for that one I will run into another one and another one and so on. As I specified in the question, the text file I want to read is rather huge – StudentOIST Oct 17 '17 at 02:10
  • 1
    Side-note: If you're using Python 2.6 or higher, with a standard (bytes->text) codec, don't use `codecs.open`, use `io.open`, which is faster, and less buggy than `codecs.open`. `io.open` is actually the same as the built-in `open` on Python 3, but made available on Python 2 to ease writing Unicode friendly code and simplify porting to Py3. – ShadowRanger Oct 17 '17 at 02:16
  • Seems to me like "utf-8 non encoded characters" is an [oxymoron](http://www.dictionary.com/browse/oxymoron). To be utf-8 its contents would need to be encoded that way. Perhaps it's `'latin1'`. – martineau Oct 17 '17 at 02:30
  • 1
    A file that was encoded using UTF-8 _cannot_ contain invalid bytes unless it got corrupted, or the encoder is buggy. I guess either of those things are possible, but it's more likely that the file was not actually encoded as UTF-8, but as something else, eg cp1252. What makes you certain that it's UTF-8? – PM 2Ring Oct 17 '17 at 02:31
  • 1
    FWIW, in cp1252 `0x92` is the encoding of the apostrophe `’` – PM 2Ring Oct 17 '17 at 02:37
  • @PM2Ring: DAMN YOU SMART-QUOTES! :-) – ShadowRanger Oct 17 '17 at 02:49

2 Answers2

10

It shouldn't "take you hours" to find the bad byte. The error tells you exactly where it is; it's at index 4484 in your input with a value of 0x92; if you did:

with open('compounds.dat', 'rb') as f:
    data = f.read()

the invalid byte would be at data[4484], and you can slice as you like to figure out what's around it.

In any event, if you just want to ignore or replace invalid bytes, that's what the errors parameter is for. Using io.open (because codecs.open is subtly broken in many ways, and io.open is both faster and more correct):

# If this is Py3, you don't even need the import, just use plain open which is
# an alias for io.open
import io

with io.open('compounds.dat', encoding='utf-8', errors='ignore') as f:
    for line in f:
        if u"InChI=1S/C11H8O3/c1-6-5-9(13)10-7(11(6)14)3-2-4-8(10)12/h2-5" in line:
            print(line)

will just ignore the invalid bytes (dropping them as if they never existed). You can also pass errors='replace' to insert a replacement character for each garbage byte, so you're not silently dropping data.

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
2

if working with huge data , better to use encoding as default and if the error persists then use errors="ignore" as well

with open("filename" , 'r'  , encoding="utf-8",errors="ignore") as f:
    f.read()
Shilpa Shinde
  • 961
  • 8
  • 10