11

I am trying to implement a "Smart Search" feature which highlights text matches in a div as a user types a keyword. The highlighting works by using a regular expression to match the keyword in the div and replace it with

<span class="highlight">keyword</span>

The application supports both English and Arabic text. English works just fine, but when highlighting Arabic, the word "breaks" the word connection on the span rather than staying a single continuous word.

I'm trying to fix the issue by using 3 separate Regex expressions and adding zero width joiners appropriately to each case:

  • Match at the Beginning of a word

    var startsWithRegex = new RegExp("((^|\\s)" + keyword + ")", "gi");

    var newSpan = "<span class='highlight'>$1&zwj;</span>&zwj;";

  • Match in the Middle of a word (Note: There can be multiple middleOf matches in a single word)

    var middleOfRegex = new RegExp("([^(^|\\s)])(" + keyword + ")([^($|\\s)])", "gi");

    var newSpan = "&zwj;$1&zwj;<span class='highlight'>&zwj;$2&zwj;</span>&zwj;$3&zwj;";

  • Match at the End of a word

    var endsWithRegex = new RegExp("(" + keyword + "($|\\s))", "gi");

    var newSpan = "&zwj;<span class='highlight'>&zwj;$1</span>";

Both startsWithRegex and endsWithRegex appear to work as expected, but middleOfRegex is not. For example:

للأبد

transforms into:

ل‍‍ل‍‍أ‍بد

when the keyword is:

ل

I've tried other various combinations of &zwj; but nothing seems to be working. Is this a limitation of webkit? Is there another implementation I can use to get my desired result?

Thanks!



A few extra notes:

Community
  • 1
  • 1
  • I've worked on this issue before and believe it is caused by the first webkit bug you linked to... which has been open for a whopping 10 years. The ‍ is helpful, but I'm not sure it will get you all the way. – TimHayes Jan 05 '16 at 20:17
  • Seems like an issue with the special lam+alif rendering. If I put a `‍` between "lam" and "alif", it breaks viewing in multiple browsers. – Zso Dec 28 '16 at 08:14

2 Answers2

1

Arabic language is a special case because the letter has different forms depending on its position in the word, I remember I solved such a problem using its Unicode, each letter’s form has different Unicode. You can find the Unicode table here

https://en.wikipedia.org/wiki/Arabic_script_in_Unicode You can get the Unicode value using

var code = $(selector).text().charCodeAt(0);
ρss
  • 5,115
  • 8
  • 43
  • 73
Majali
  • 480
  • 7
  • 11
  • I was not able to get the desired result by using charCodeAt in the text. When comparing the unicode values between different forms of a letter, I was finding that they are the actually same value. Here's a jsfiddle I wrote to demonstrate what I'm seeing https://jsfiddle.net/avfpjnc7/ Is there something I'm missing? – Drew MacLaren Jan 05 '16 at 16:55
  • Drew: I don't think you are missing anything. I can't see how this answer would help you fix the problem. – TimHayes Jan 05 '16 at 19:51
  • Please try to use unescape(); and escape(); they might help. – Majali Jan 06 '16 at 00:17
  • unescape() and escape() are depracated from javascript 1.5 I tried encodeURI() and decodeURI() but it's the same problem. Same unicode character, same encoding – Drew MacLaren Jan 06 '16 at 18:51
  • I hope that will help, as long as regex expression function does not match the different forms of the same letter and it considers them different letters. I will make my smart search based on CharCodeAt because it returns the same Unicode. For example: if I write “abc” in search box, the script will search the database for records where the Unicode of the first three letters equal to the Unicode of “abc”. Another thing I will do if possible, I will switch off the highlight style in search box and I will show only the suggested results as full phrases without span in the drop menu. – Majali Jan 08 '16 at 10:19
0

I suggest not to separate this ligature, but to extend the <span> tag to enclose the entire lam+alif structure for highlighting.

According to http://www.unicode.org/versions/Unicode7.0.0/ch23.pdf#G25237, ZWJ works as ZWJ+ZWNJ+ZWJ between ل(lam) and ا(alif). It should be rendered as a connected lam followed by a connected alif (ل‍‌‍ا), not like the required ligature (لا).

Seems to me most browsers/fonts adhere to this requirement.

My answer applies to other ligatures as well, if you use them in your application (non-required ones, e.g.: mim + mim).

Zso
  • 476
  • 4
  • 6