Let's examine that error message very closely:
"UnicodeDecodeError: 'utf8' codec can't decode bytes in position 8-13: unsupported Unicode code range"
Note carefully that it says "bytes in position 8-13" -- that's a 6-byte UTF-8 sequence. That might have been valid in the dark ages, but since Unicode was frozen at 21 bits, the maximum is FOUR bytes. UTF-8 validations and error reporting were tightened up recently; as a matter of interest, exactly what version of Python are you running?
With 2.7.1 and 2.6.6 at least, that error becomes the more useful "... can't decode byte XXXX in position 8: invalid start byte" where XXXX can be only be 0xfc or 0xfd if the old message suggested a 6-byte sequence. In ISO-8859-1 or cp1252, 0xfc represents U+00FC LATIN SMALL LETTER U WITH DIAERESIS (aka u-umlaut, a likely suspect); 0xfd represents U+00FD LATIN SMALL LETTER Y WITH ACUTE (less likely).
The problem is NOT with the if line.startswith(u"Fußnote"):
statement in your source file. You would have got a message at COMPILE time if it wasn't proper UTF-8, and the message would have started with "SyntaxError", not "UnicodeDecodeError". In any case the UTF-8 encoding of that string is only 8 bytes long, not 14.
The problem is (as @Mark Tolonen has pointed out) in whatever "line" is referring to. It can only be a str object.
To get further you need to answer Mark's questions (1) result of print repr(line)
(2) site.py
change.
At this stage it's a good idea to clear the air about mixing str
and unicode
objects (in many operations, not just a.startswith(b)
).
Unless the operation is defined to produce a str
object, it will NOT coerce the unicode
object to str
. This is not the case with a.startswith(b)
.It will attempt to decode the str
object using the default (usually 'ascii') encoding.
Examples:
>>> "\xff".startswith(u"\xab")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
>>> u"\xff".startswith("\xab")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 0: ordinal not in range(128)
Furthermore, it is NOT correct to say "Mix and you get UnicodeDecodeError". It is quite possible that the str
object is validly encoded in the default encoding (usually 'ascii') -- no exception is raised.
Examples:
>>> "abc".startswith(u"\xff")
False
>>> u"\xff".startswith("abc")
False
>>>