Python 2 strings are byte strings, and UTF-8 encoded text uses multiple bytes per character. Just because your terminal manages to interpret the UTF-8 bytes as characters, doesn't mean that Python knows about what bytes form one UTF-8 character.
Your bytestring consists of 6 bytes, every two bytes form one character:
>>> a = "čšž"
>>> a
'\xc4\x8d\xc5\xa1\xc5\xbe'
However, how many bytes UTF-8 uses depends on where in the Unicode standard the character is defined; ASCII characters (the first 128 characters in the Unicode standard) only need 1 byte each, and many emoji need 4 bytes!
In UTF-8 order is everything; reversing the above bytestring reverses the bytes, resulting in some gibberish as far as the UTF-8 standard is concerned, but the middle 4 bytes just happen to be valid UTF-8 sequences (for š
and ō
):
>>> a[::-1]
'\xbe\xc5\xa1\xc5\x8d\xc4'
-----~~~~~~~~^^^^^^^^####
| š ō |
\ \
invalid UTF8 byte opening UTF-8 byte missing a second byte
You'd have to decode the byte string to a unicode
object, which consists of single characters. Reversing that object gives you the right results:
b = a.decode('utf8')[::-1]
print b
You can always encode the object back to UTF-8 again:
b = a.decode('utf8')[::-1].encode('utf8')
Note that in Unicode, you can still run into issues when reversing text, when combining characters are used. Reversing text with combining characters places those combining characters in front rather than after the character they combine with, so they'll combine with the wrong character instead:
>>> print u'e\u0301a'
éa
>>> print u'e\u0301a'[::-1]
áe
You can mostly avoid this by converting the Unicode data to its normalised form (which replaces combinations with 1-codepoint forms) but there are plenty of other exotic Unicode characters that don't play well with string reversals.