Replace Arabic text with Python

Question

I have a str that has Arabic characters in it

text = "صَوتُ صَفيرِ البُلْبُلِ"

I am trying to remove specific characters like ص I tried

text.replace("ص", "")

but nothing worked. I searched and found some blogs saying that we need to write Arabic with English but that is not pratic.

When you try this, what output do you get, and what do you expect it to be? Removing ARABIC LETTER SAD in the text you've given will leave ARABIC FATHA that are incorrectly attached (most importantly, the string will *start* with a fatha). What do you expect to happen in that case? (When I try the above with Python 3.9, it "works" in that the ص are removed. What happens for you?) — Rob Napier, Mar 26 '22 at 17:53
the string stays the same i want to remove all letters and only keep tashkil ص is only an example then I want to replace watch tashkil with it's Id — abdelmoumen, Mar 26 '22 at 18:13
I can't reproduce this. How are you validating that the string does not change? (Are you aware that `.replace` returns a *new* str? It doesn't change the existing one. If you want to replace the existing one, you'd use `text = text.replace(...)`. Python strings are immutable. It's not clear from your code above, so it would be helpful if you provide your full test case, along with what you expect the result to be.) — Rob Napier, Mar 26 '22 at 20:55

jmd_dk · Answer 1 · 2022-03-27T13:37:07.573

I'm not familiar with Arabic text, but I do know that Arabic letters work differently than letters in Latin/English. Nearby letters somehow affect each other, which might be the source of confusion here.

Here's what happens when you carry out the replacement (in Python 3):

text = "صَوتُ صَفيرِ البُلْبُلِ"
text2 = text.replace("ص", "")
print(text)   # صَوتُ صَفيرِ البُلْبُلِ
print(text2)  # وتُ َفيرِ البُلْبُلِ

The comments above are copied from the printed output. However, if you copy them back in, the copied output of text2 is in fact not identical to text2. Something is missing from the printout (this is not the case for the original text). I imagine that the resulting text2 in fact is not possible to print out correctly (i.e. in Arabic, some combinations of characters/symbols does not result in meaningful text).

Let's not rely on direct printout then, but instead consider each character at a time:

import itertools, unicodedata
text = "صَوتُ صَفيرِ البُلْبُلِ"
text2 = text.replace("ص", "")
def compare_texts(text1, text2):
    for c1, c2 in itertools.zip_longest(text1, text2, fillvalue=""):
        name1 = unicodedata.name(c1) if c1 else ''
        name2 = unicodedata.name(c2) if c2 else ''
        print(f"{name1:<20} : {name2:<20}")
compare_texts(text, text2)

From the output of the above, we see that text2 indeed is just text with ARABIC LETTER SAD ('ص') missing in two places.

In conclusion: str.replace() does what you want (or at least what you tell it to do), it just might not look like it in the (naïvely) printed output.

Bonus

Here's a short video describing how/why Arabic (and other non-Latin writing systems) are more complicated than the one.

Replace Arabic text with Python

1 Answers1

Bonus