6

How would I get the character count of the below in python?

s = 'הוא אוסף אתכם מחר בשלוש וחצי.'

Char count: 29
Char length: 52

len(s) = 52
? = 29
smci
  • 32,567
  • 20
  • 113
  • 146
David542
  • 104,438
  • 178
  • 489
  • 842

2 Answers2

7

decode your byte string (according to whatever encoding it's in, utf-8 maybe) -- the len of the resulting Unicode string is what you're after.

If fact best practice is to decode inputs as soon as possible, deal only with actual text (i.e, unicode, in Python 2; it's just the way ordinary strings are, in Python 3) in your code, and if need be encode just as you're outputting again.

Byte strings should be handled in your program only if it's specifically about byte strings (e.g, controlling or monitoring some hardware device, &c) -- far more programs are about text, and thus, except where indispensable at some I/O boundaries, they should be exclusively dealing with text strings (spelled unicode in Python 2:-).

But if you do want to keep s as a bytestring nevertheless,

len(s.decode('utf-8'))

(or whatever other encoding you're using to represent text as byte strings) should still do what you request.

Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
3

Use a unicode string

    s = 'הוא אוסף אתכם מחר בשלוש וחצי.'
    len(s) #52
    s = u'הוא אוסף אתכם מחר בשלוש וחצי.'
    len(s) #29
Bjorn
  • 69,215
  • 39
  • 136
  • 164
  • I get `Unsupported characters in input`. – Malik Brahimi Jan 26 '15 at 22:44
  • Maybe a Python 2 thing? – Malik Brahimi Jan 26 '15 at 22:46
  • 1
    @MalikBrahimi and Bjorn, so the two of you are using different encodings in your sources or interactive interpreter or IDE or whatever -- neither is on the Python 2 default encoding of `ascii`, clearly, so you might want to check:-) – Alex Martelli Jan 26 '15 at 22:47
  • I'm using IDLE, is there a way I can change the encoding. – Malik Brahimi Jan 26 '15 at 23:00
  • @MalikBrahimi: See [_Where does this come from: -*- coding: utf-8 -*-_](http://stackoverflow.com/questions/4872007/where-does-this-come-from-coding-utf-8) and try adding it to the top of your script. I'm not sure if IDLE honors this or not...but it should. – martineau Jan 27 '15 at 02:11