Python UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Question

I am getting this error on ubuntu 18.04, using python 3.6:

  File "/home/sw/miniconda3/envs/py36/lib/python3.6/codecs.py", line 644, in __next__
    line = self.readline()
  File "/home/sw/miniconda3/envs/py36/lib/python3.6/codecs.py", line 557, in readline
    data = self.read(readsize, firstline=True)
  File "/home/sw/miniconda3/envs/py36/lib/python3.6/codecs.py", line 503, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

I have tried using utf-16, latin1 encoding but nothing works. Any help is appreciated.

I assume you are trying to read text from a file. Why do you believe the file contains (meaningful) text? Where did the file come from, and what did the source claim about the encoding? What does the file look like when you view it in a hex editor? What happens if you try to open it in your system text editor? — Karl Knechtel, Sep 05 '20 at 00:40
@SieTw no, I mean, when you `open()` the file, what mode are you doing it in? The default is `'r'`, if you don't provide an argument, but for some files you want to open in mode `'rb'`, because they're not properly encoded for text. — Green Cloak Guy, Sep 05 '20 at 00:56
"i can read the file fine in the text editor" Okay, and what does it look like? What text encoding does the editor tell you it tried? What language is used for the text? — Karl Knechtel, Sep 05 '20 at 02:04
@RickJames I get the hex dump as: 0000000 feff 0027 0057 0065 0020 0063 0061 006e 0000010 006e 006f 0074 0020 0077 0061 0073 0074 — Sie Tw, Sep 05 '20 at 15:25
`feff` bytes mean https://en.wikipedia.org/wiki/Byte_order_mark#UTF-16 — JosefZ, Sep 05 '20 at 17:31
@RickJames Now, strangely, when I read in utf-16 inside a terminal, I can read the file, but the same code does not work in a file: https://superuser.com/questions/1583311/python-code-working-in-terminal-but-not-in-python-file — Sie Tw, Sep 06 '20 at 00:08

score 1 · Answer 1 · answered Sep 05 '20 at 21:09

1

UTF-16 / ucs2 -- These are not useful encodings, except that they might be coming from Java or maybe some MicroSoft Office product. The first 2 bytes is the "BOM", which you may have to manually step over.

The goal is to tell python/mysql/whoever that the file is encoded "utf-16" or "ucs2", depending on what is available to the language.

answered Sep 05 '20 at 21:09

Rick James

135,179
13
127
222

Could you take a look at https://superuser.com/questions/1583311/python-code-working-in-terminal-but-not-in-python-file Now I can read inside the terminal, but not in a file :( thank you! – Sie Tw Sep 06 '20 at 00:08
1

Is `"-le"` correct? – Rick James Sep 06 '20 at 00:38
when I real with utf-16-le, I get this error: UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 7070-7071: illegal UTF-16 surrogate – Sie Tw Sep 06 '20 at 00:48

Python UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

1 Answers1