7

I have a Korean string encoded as Unicode like u'정정'. How do I know how many bytes are needed to represent this string?

I need to know the exact byte count since I'm using the string for iOS push notification and it has a limit on the size of the payload.

len('정정') doesn't work because that returns the number of characters, not the number of bytes.

dda
  • 6,030
  • 2
  • 25
  • 34
jasondinh
  • 918
  • 7
  • 21

3 Answers3

14

You need to know what encoding you want to measure your byte size in:

>>> print u'\uC815\uC815'
정정
>>> print len(u'\uC815\uC815')
2
>>> print len(u'\uC815\uC815'.encode('UTF-8'))
6
>>> print len(u'\uC815\uC815'.encode('UTF-16-LE'))
4
>>> print len(u'\uC815\uC815'.encode('UTF-16'))
6
>>> print len(u'\uC815\uC815'.encode('UTF-32-LE'))
8
>>> print len(u'\uC815\uC815'.encode('UTF-32'))
12

You really want to review the Python Unicode HOWTO to fully appreciate the difference between a unicode object and it's byte encoding.

Another excellent article is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), by Joel Spolsky (one of the people behind Stack Overflow).

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • How did you know this char is '\uC815'? What encoding is this? I did try utf-8/16/32 and none of them is correct, but '\uC815' seems to be working. – jasondinh Aug 06 '12 at 17:21
  • I have an application called UnicodeChecker that I use for reference, but `C815` is the unicode code point. If you know the UTF-8 or UTF-16 byte sequence, you can *decode* from that to get the unicode character (`'\xEC\xA0\x95'.decode('UTF-8')`). The python prompt is helpful here; python will use it's `unicode_escape` encoding when echoing (not printing) unicode values to the terminal, for example. – Martijn Pieters Aug 06 '12 at 17:28
4

The number of bytes required to represent the unicode varies depending on the encoding you use.

>>> s = u'정정'
>>> len(s)
2
>>> len(s.encode('UTF-8'))
6
>>> len(s.encode('UTF-16'))
6
>>> len(s.encode('UTF-32'))
12

If you're going to reuse the encoding result, I recommend encoding it once, then pulling its len and reusing the already-encoded result later.

Mattie
  • 20,280
  • 7
  • 36
  • 54
0

Make sure you are using the correct standard encoding.

If you're not, you can always decodedString = myString.decode('UTF-8') (substitute UTF-8 with the correct encoding string that you can find from the previous link, if not UTF-8) to get the string in a format where len(decodedString) should return the proper number

Hans Z
  • 4,664
  • 2
  • 27
  • 50