5

When we believe Python 3 got everything right on Unicode I am surprised while I faced this situation.

>>> amma = "அம்மா"
>>> amma
'அம்மா'
>>> len(amma)
5

Apparently the Tamil string "அம்மா" has 3 letters, A return value of 5 for len("அம்மா") in no way can be accepted or appreciated.

How are the other Dravidian or Brahmic scripts solve this issue to get the right string length?

Edit #1: Considering the comment of @joey this question can be rephrased as below.

How to calculate the grapheme length in Python?

We know Swift or Perl6 does this by default

  2> let amma = "அம்மா".characters.count
amma: Distance = 3
nehem
  • 12,775
  • 6
  • 58
  • 84
  • @Mijago: Nope, it won't. – Joey Jan 27 '16 at 10:26
  • 1
    The [grapheme](https://pypi.org/project/grapheme/) package on pypi seems to do what you want. I don't believe there's an easy solution using only the tools in the standard libarary (though the unicodedata module's tools might be useful, depending on your needs). – snakecharmerb Jul 25 '20 at 15:43

3 Answers3

2

It may have 3 letters, but it has 5 characters:

$ charinfo 'அம்மா'
U+0B85 TAMIL LETTER A [Lo]
U+0BAE TAMIL LETTER MA [Lo]
U+0BCD TAMIL SIGN VIRAMA [Mn]
U+0BAE TAMIL LETTER MA [Lo]
U+0BBE TAMIL VOWEL SIGN AA [Mc]

If you need to be more specific then you will need to only count the number of characters that are in the Letter category.

Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358
  • 3
    More precisely: 3 graphemes, but 5 code points. Counting graphemes in a string seems to be a bit complicated in Python, though (can't find any good samples). – Joey Jan 27 '16 at 10:25
  • @Joey Thou soundeth well informed. This is driving me nuts now :( – nehem Jan 27 '16 at 10:30
  • You can use [regex](https://pypi.python.org/pypi/regex) to strip out anything you don't want, but the hard part is figuring out what you do want in the first place. – Ignacio Vazquez-Abrams Jan 27 '16 at 10:34
-2

Package

pip install Open-Tamil

Code

from tamil import utf8
amma = "அம்மா"
letters = utf8.get_letters(amma)
print(len(letters))
Smart Manoj
  • 5,230
  • 4
  • 34
  • 59
-2

Below code only count the characters and ignores unicode marks (using standard re module).

import re
amma = "அம்மா"
len(re.findall("[ஃ-ஹ]", amma))

Below is the fastest way to get letters counts in unicode (using the third-party regex module).

import regex
amma = "அம்மா"
len(regex.findall('\p{L}\p{M}*', amma))
wovano
  • 4,543
  • 5
  • 22
  • 49
Neechalkaran
  • 413
  • 4
  • 6