python utf-8 behaviour

Question

Possible Duplicate:
Python returning the wrong length of string when using special characters

I read a multilingual string from file in windows-1251, for example s="qwe абв" (second part in Russian), and then:

for i in s.decode('windows-1251').encode('utf-8').split():
  print i, len(i)

and I get:

qwe 3
абв 6

Oh God, why? o_O

Andrew Clark · Accepted Answer · 2012-10-08T16:16:55.737

3

In programming languages you can't always think of strings as a sequence of characters, because generally they are actually a sequence of bytes. You can't store every character or symbol in 8 bits, character encodings create some rules to combine multiple bytes into a single character.

In the case of the string 'абв' encoded in utf-8, what you have is 6 bytes that represent 3 characters. If you want to count the number of characters instead of the number of bytes, make sure you are taking the length from a unicode string.

edited Oct 08 '12 at 16:16

answered Oct 07 '12 at 06:21

Andrew Clark

202,379
35
273
306

I guessed something like this... thanks. – scythargon Oct 07 '12 at 06:26
This is the correct answer for your question of 'why' this happens -- if you're interested in a way to achieve what you perhaps expected (i.e. to be able to count characters), use the codecs module to open the file you're reading in ... this will coerce it to unicode while reading, and with the native unicode strings the len() method will return the number of characters. – jlmcdonald Oct 07 '12 at 06:48
@jlmcdonald or just don't reencode to utf-8 - `s.decode('windows-1251')` gives a unicode string. – lvc Oct 07 '12 at 06:52

score 2 · Answer 2 · answered Oct 07 '12 at 06:34

2

>>> print "абв"
абв
>>> print [char for char in "абв"]
['\xd0', '\xb0', '\xd0', '\xb1', '\xd0', '\xb2']

That's why :)

answered Oct 07 '12 at 06:34

Anuj Gupta

10,056
3
28
32

python utf-8 behaviour

2 Answers2