4

I am given a string of Hebrew characters (and some other Arabic ones. I know neither of them) in a file

צוֹר‎

When I load this string from file in Python3

fin = open("filename")
x = next(fin).strip()

The length of x appears to be 5

>>> len(x)
5

Its unicode utf-8 encoding is

>>> x.encode("utf-8")
b'\xd7\xa6\xd7\x95\xd6\xb9\xd7\xa8\xe2\x80\x8e'

However, in browsers, it is clear that the length of these Hebrew characters is 3.

How to get the length properly? And why does this happen?

I am aware that Python 3 is by default unicode so I did not expect there is such an issue.

Alan Kavanagh
  • 9,425
  • 7
  • 41
  • 65
Yo Hsiao
  • 678
  • 7
  • 12
  • "it is clear that the length of these Hebrew characters is 3" — It is clear that the computer disagrees with you, can you explain your position? – Josh Lee Dec 18 '17 at 02:03
  • `len(re.findall('\w', x))` – skrubber Dec 18 '17 at 02:05
  • 1
    I don't know how many characters are there -- I don't read Hebrew. But I do know that there are 5 unicode code points there. Try this in Python3: `for ch in 'צוֹר‎': print(unicodedata.name(ch))` – Robᵩ Dec 18 '17 at 02:06
  • Related: https://stackoverflow.com/questions/2247205/python-returning-the-wrong-length-of-string-when-using-special-characters – Robᵩ Dec 18 '17 at 02:07
  • 1
    Consider also breaking the text into _grapheme clusters_ https://pypi.python.org/pypi/uniseg – Josh Lee Dec 18 '17 at 02:17
  • @joshlee I use the mouse cursor to select/highlight, and the perceived number of characters is 3. – Yo Hsiao Dec 18 '17 at 02:24

4 Answers4

6

The reason is the included text contains the control character \u200e which is an invisible character used as a Left-to-right mark (often used when you have multiple languages mixed to demarcate between the Left-to-Right and Right-to-Left). Additionally, it includes the vowel "character" (the little dot above the second character which shows how to pronounce it).

If you replace the LTR mark with the empty string for instance, you will get the length of 4:

>> x = 'צוֹר'
>> x
'צוֹר\u200e' # note the control character escape sequence
>> print(len(x))
5

>> print(len(x.replace('\u200e', ''))
4

If you want the length of strictly alphabetic character and space characters only, you could do something like re.sub out all non-space non-word characters:

>> print(len(re.sub('[^\w\s]', '', x)))
3
lemonhead
  • 5,328
  • 1
  • 13
  • 25
  • Nice answer! A follow-up question: if I have `x = "צוֹר abc (123)"` and I want to use index to access the 123, how could I do it? Naively 'a' is at 4, and '1' is at 9. The substitution you suggested removes the punctuation as well. – Yo Hsiao Dec 18 '17 at 02:18
  • 1
    Hmm, well it depends what you are looking to do. The "correct" indices for the raw text would be 6 and 9 due to the control and accent characters. If you want a version of the text which explicitly excludes non-spacing marks and control characters only, you could do something like (borrowing from @MichaelButscher's answer): `''.join(c for c in x if unicodedata.category(c) not in ['Mn', 'Cf'])` – lemonhead Dec 18 '17 at 02:24
  • Correction: should be indices 6 and 11 above ^ – lemonhead Dec 18 '17 at 02:30
4

Unicode characters have different categories. In your case:

>>> import unicodedata
>>> s = b'\xd7\xa6\xd7\x95\xd6\xb9\xd7\xa8\xe2\x80\x8e'.decode("utf-8")
>>> list(unicodedata.category(c) for c in s)
['Lo', 'Lo', 'Mn', 'Lo', 'Cf']
  • Lo: Letter, other (not uppercase, lowercase or such). These are "real" characters
  • Mn: Mark, nonspacing. This is some type of accent character combined with the previous character
  • Cf: Control, format. Here it switches back to left-to-right write direction
Michael Butscher
  • 10,028
  • 4
  • 24
  • 25
  • Nice way to distill the "real" characters. If I have other words after the Hebrew characters and I want to index them "correctly" (counting only the "real" characters), is there a way to do it? – Yo Hsiao Dec 18 '17 at 02:20
  • 1
    @YoHsiao I only see the way to iterate through the code points and look at each or first convert them using lemonheads approach to get the position of the filtered real characters and words. – Michael Butscher Dec 18 '17 at 02:24
0

Have you tried with io libary?

>>> import io
>>> with io.open('text.txt',  mode="r", encoding="utf-8") as f:
     x = f.read()
>>> print(len(x))

You can also try codecs:

>>> import codecs
>>> with codecs.open('text.txt', 'r', 'utf-8') as f:
     x = f.read()
>>> print(len(x))
Dawid Laszuk
  • 1,773
  • 21
  • 39
  • Thanks for the advice! But these two give identical results: 5. In fact, if you open it with an editor that decodes it correctly, moving the cursor will show that there are some underlying characters that modify "backward". In other words, clicking on "right" button moves the cursor back and forth, just not always forward. Backward modification is just like the unicode for accents. – Yo Hsiao Dec 18 '17 at 02:05
  • 1
    Using io and codecs is needed in Python 2 but generally not in Python 3. – Josh Lee Dec 18 '17 at 02:15
  • @JoshLee that's what I thought as all fake example I made were working out of box. Just thought of throwing it out there. – Dawid Laszuk Dec 18 '17 at 02:32
0

Open the file with utf-8 encoding.

fin = open('filename','r',encoding='utf-8')

or

with open('filename','r',encoding='utf-8') as fin:
    for line1 in fin:
        print(len(line1.strip()))
Ahmad Yoosofan
  • 961
  • 12
  • 21