Python 3 len() function for Unicode characters

Question

When we believe Python 3 got everything right on Unicode I am surprised while I faced this situation.

>>> amma = "அம்மா"
>>> amma
'அம்மா'
>>> len(amma)
5

Apparently the Tamil string "அம்மா" has 3 letters, A return value of 5 for len("அம்மா") in no way can be accepted or appreciated.

How are the other Dravidian or Brahmic scripts solve this issue to get the right string length?

Edit #1: Considering the comment of @joey this question can be rephrased as below.

How to calculate the grapheme length in Python?

We know Swift or Perl6 does this by default

  2> let amma = "அம்மா".characters.count
amma: Distance = 3

The [grapheme](https://pypi.org/project/grapheme/) package on pypi seems to do what you want. I don't believe there's an easy solution using only the tools in the standard libarary (though the unicodedata module's tools might be useful, depending on your needs). — snakecharmerb, Jul 25 '20 at 15:43

score 2 · Answer 1 · answered Jan 27 '16 at 10:23

2

It may have 3 letters, but it has 5 characters:

$ charinfo 'அம்மா'
U+0B85 TAMIL LETTER A [Lo]
U+0BAE TAMIL LETTER MA [Lo]
U+0BCD TAMIL SIGN VIRAMA [Mn]
U+0BAE TAMIL LETTER MA [Lo]
U+0BBE TAMIL VOWEL SIGN AA [Mc]

If you need to be more specific then you will need to only count the number of characters that are in the Letter category.

answered Jan 27 '16 at 10:23

Ignacio Vazquez-Abrams

776,304
153
1,341
1,358

3

More precisely: 3 graphemes, but 5 code points. Counting graphemes in a string seems to be a bit complicated in Python, though (can't find any good samples). – Joey Jan 27 '16 at 10:25
@Joey Thou soundeth well informed. This is driving me nuts now :( – nehem Jan 27 '16 at 10:30
You can use [regex](https://pypi.python.org/pypi/regex) to strip out anything you don't want, but the hard part is figuring out what you do want in the first place. – Ignacio Vazquez-Abrams Jan 27 '16 at 10:34

score -2 · Answer 2 · answered Jul 24 '20 at 07:50

-2

Package

pip install Open-Tamil

Code

from tamil import utf8
amma = "அம்மா"
letters = utf8.get_letters(amma)
print(len(letters))

answered Jul 24 '20 at 07:50

Smart Manoj

5,230
4
34
59

score -2 · Answer 3 · edited Aug 25 '20 at 06:21

-2

Below code only count the characters and ignores unicode marks (using standard re module).

import re
amma = "அம்மா"
len(re.findall("[ஃ-ஹ]", amma))

Below is the fastest way to get letters counts in unicode (using the third-party regex module).

import regex
amma = "அம்மா"
len(regex.findall('\p{L}\p{M}*', amma))

edited Aug 25 '20 at 06:21

wovano

4,543
5
22
49

answered Jul 24 '20 at 12:51

Neechalkaran

413
4
6

Python 3 len() function for Unicode characters

3 Answers3