Arabic text zero width joiners not working between elements

Question

I am trying to implement a "Smart Search" feature which highlights text matches in a div as a user types a keyword. The highlighting works by using a regular expression to match the keyword in the div and replace it with

<span class="highlight">keyword</span>

The application supports both English and Arabic text. English works just fine, but when highlighting Arabic, the word "breaks" the word connection on the span rather than staying a single continuous word.

I'm trying to fix the issue by using 3 separate Regex expressions and adding zero width joiners appropriately to each case:

Match at the Beginning of a word

var startsWithRegex = new RegExp("((^|\\s)" + keyword + ")", "gi");

var newSpan = "$1&zwj;&zwj;";
Match in the Middle of a word (Note: There can be multiple middleOf matches in a single word)

var middleOfRegex = new RegExp("([^(^|\\s)])(" + keyword + ")([^($|\\s)])", "gi");

var newSpan = "&zwj;$1&zwj;&zwj;$2&zwj;&zwj;$3&zwj;";
Match at the End of a word

var endsWithRegex = new RegExp("(" + keyword + "($|\\s))", "gi");

var newSpan = "&zwj;&zwj;$1";

Both startsWithRegex and endsWithRegex appear to work as expected, but middleOfRegex is not. For example:

للأبد

transforms into:

ل‍‍ل‍‍أ‍بد

when the keyword is:

ل

I've tried other various combinations of &zwj; but nothing seems to be working. Is this a limitation of webkit? Is there another implementation I can use to get my desired result?

Thanks!

A few extra notes:

This is only happening for Webkit based browsers (Chrome specifically in my case) and we cannot use an alternative. I believe this bug is the root cause of the issue: https://bugs.webkit.org/show_bug.cgi?id=6148
This question is an extension on these two stackoverflow questions:

Inserting HTML tag in the middle of Arabic word breaks word connection (cursive)

Partially colored Arabic word in HTML

I've worked on this issue before and believe it is caused by the first webkit bug you linked to... which has been open for a whopping 10 years. The ‍ is helpful, but I'm not sure it will get you all the way. — TimHayes, Jan 05 '16 at 20:17
Seems like an issue with the special lam+alif rendering. If I put a `‍` between "lam" and "alif", it breaks viewing in multiple browsers. — Zso, Dec 28 '16 at 08:14

score 1 · Answer 1 · edited Jan 04 '16 at 20:41

1

Arabic language is a special case because the letter has different forms depending on its position in the word, I remember I solved such a problem using its Unicode, each letter’s form has different Unicode. You can find the Unicode table here

https://en.wikipedia.org/wiki/Arabic_script_in_Unicode You can get the Unicode value using

var code = $(selector).text().charCodeAt(0);

edited Jan 04 '16 at 20:41

ρss

5,115
8
43
73

answered Jan 04 '16 at 18:27

Majali

480
7
11

I was not able to get the desired result by using charCodeAt in the text. When comparing the unicode values between different forms of a letter, I was finding that they are the actually same value. Here's a jsfiddle I wrote to demonstrate what I'm seeing https://jsfiddle.net/avfpjnc7/ Is there something I'm missing? – Drew MacLaren Jan 05 '16 at 16:55
Drew: I don't think you are missing anything. I can't see how this answer would help you fix the problem. – TimHayes Jan 05 '16 at 19:51
Please try to use unescape(); and escape(); they might help. – Majali Jan 06 '16 at 00:17
unescape() and escape() are depracated from javascript 1.5 I tried encodeURI() and decodeURI() but it's the same problem. Same unicode character, same encoding – Drew MacLaren Jan 06 '16 at 18:51
I hope that will help, as long as regex expression function does not match the different forms of the same letter and it considers them different letters. I will make my smart search based on CharCodeAt because it returns the same Unicode. For example: if I write “abc” in search box, the script will search the database for records where the Unicode of the first three letters equal to the Unicode of “abc”. Another thing I will do if possible, I will switch off the highlight style in search box and I will show only the suggested results as full phrases without span in the drop menu. – Majali Jan 08 '16 at 10:19

Zso · Answer 2 · 2016-12-28T09:21:00.617

I suggest not to separate this ligature, but to extend the  tag to enclose the entire lam+alif structure for highlighting.

According to http://www.unicode.org/versions/Unicode7.0.0/ch23.pdf#G25237, ZWJ works as ZWJ+ZWNJ+ZWJ between ل(lam) and ا(alif). It should be rendered as a connected lam followed by a connected alif (ل‍‌‍ا), not like the required ligature (لا).

Seems to me most browsers/fonts adhere to this requirement.

My answer applies to other ligatures as well, if you use them in your application (non-required ones, e.g.: mim + mim).

Arabic text zero width joiners not working between elements

2 Answers2