11

given a character like "" (\xe2\x9c\xae), for example, can be others like "Σ", "д" or "Λ") I want to find the "actual" length that character takes when printed onscreen

for example

len("✮")
len("\xe2\x9c\xae")

both return 3, but it should be 1

smci
  • 32,567
  • 20
  • 113
  • 146
user3584604
  • 119
  • 1
  • 4
  • 2
    Try: `len("✮".decode("utf-8"))` – Grijesh Chauhan Apr 29 '14 at 12:49
  • Won't that depend on the font used and also what characters surround it - what is the overall thing you are trying to do? – mmmmmm Apr 29 '14 at 12:51
  • `len("\xe2\x9c\xae".decode('UTF-8'))` works perfectly in python2.7.5. – Cthulhu Apr 29 '14 at 13:01
  • 2
    There are several ways to define length (and width) here. It would help to know what you want this for: for instance, are you trying to work out how many characters will fit in a row on the screen? – deltab Apr 29 '14 at 14:55

2 Answers2

3

You may try like this:

unicodedata.normalize('NFC', u'✮')
len(u"✮")

UTF-8 is an unicode encoding which uses more than one byte for special characters. Check unicodedata.normalize()

Rahul Tripathi
  • 168,305
  • 31
  • 280
  • 331
  • 3
    Even this doesn't necessarily count user-perceived characters or grapheme clusters; some uses of diacritics don't have a single-code-point representation. I also don't see how UTF-8 (specifically) enters the picture? –  Apr 29 '14 at 09:28
  • this also return len(unicodedata.normalize('NFC', u'✮')) = 3 – user3584604 Apr 29 '14 at 09:40
  • Even without diacritics, some code points map to no glyph at all (think about control characters, word joiners, soft hyphens and so on). No amount of normalization will get you rid of these. (Back on topic: `u'✮'` is already in normal form so normalization is a no-op here; the OP’s actual problem was with the UTF-8 encoding being multibyte; hopefully as of 2022 we are all using Python 3 and `len()` correctly counts code points, rather than bytes.) – Maëlan Jun 06 '22 at 00:28
0

My answer to a similar question:

You are looking for the rendering width from the current output context. For graphical UIs, there is usually a method to directly query this information; for text environments, all you can do is guess what a conformant rendering engine would probably do, and hope that the actual engine matches your expectations.

Community
  • 1
  • 1
Simon Richter
  • 28,572
  • 1
  • 42
  • 64
  • 3
    Rendering width in pixels is another topic. I can't see that this has been asked. – Thomas Weller Apr 29 '14 at 14:24
  • For monospaced text output, the standard glyph width is the smallest addressable unit, and we are interested in multiples of that unit -- that is not so different from pixel width. – Simon Richter Apr 29 '14 at 14:26