Using RegEx to highlight Arabic text

Question

My database contains Arabic text with diacritics/tashkeel. To search user types without diacritics/tashkeel and I can successfully search using full-text search statements but unable to highlight the search term using regular expressions:

$str="اِنَّ الَّذِیۡنَ اٰمَنُوۡا وَ عَمِلُوا الصّٰلِحٰتِ وَ اَخۡبَتُوۡۤا اِلٰی رَبِّہِمۡ ۙ اُولٰٓئِکَ اَصۡحٰبُ الۡجَنَّۃِ ۚ ہُمۡ فِیۡہَا خٰلِدُوۡنَ";

$ptr="عملوا";

$result = preg_replace("/$ptr/", '<span style="background:yellow">' . $ptr . '</span>', $str);

echo $result;

Any ideas on how to resolve this?

I can't read Arabic but clearly the search pattern in `$ptr` isn't in `$str`. `عَمِلُوا` and `عملوا` are two completely different strings programmatically speaking. — accdias, May 10 '21 at 20:11
Is this a simple matter of writing the missing unicode flag on the expression? https://3v4l.org/jHv59 Run that demo and click on the eye icon on the right in the output area to see that the found string is correctly highlighted (once you actually have a match between the needle and the haystack). ...to answer my own question, no, the unicode flag is not needed if the needle _actually_ exists in `$str`. https://3v4l.org/lfRDp This is not a good [mcve]. — mickmackusa, May 11 '21 at 01:49

Artier · Accepted Answer · 2021-05-11T05:47:10.703

Your string has extra character like tashkil. but character you want to match have no tashkil so solution is replace extra char and make both strings similar.

<?php
function stripDiacritics($str) {
    $diacritic = array("ِ" ,"ٰ" ,"ّ" ,"ۡ" ,"ٖ" ,"ٗ" ,"ؘ" ,"ؙ" ,"ؚ" ,"ٍ" ,"َ" ,"ُ", "ٓ" ,"ْ" , "ٌ" , "ٍ",  "ً",  "ّ", "ۤ");
    $str = str_replace($diacritic, '', $str); 
    return $str;       
}

$str="اِنَّ الَّذِیۡنَ اٰمَنُوۡا وَ عَمِلُوا الصّٰلِحٰتِ وَ اَخۡبَتُوۡۤا اِلٰی رَبِّہِمۡ ۙ اُولٰٓئِکَ اَصۡحٰبُ الۡجَنَّۃِ ۚ ہُمۡ فِیۡہَا خٰلِدُوۡنَ";
$words = explode(" ",$str);
$resultText='';
foreach ($words as $word) {
    $strippedWord = stripDiacritics($word);
    $ptr="عملوا";
    if ($strippedWord == $ptr) {
        $resultText .= ' <span style="background:yellow">'.$word.'</span>';
    }            
    else {
        $resultText .= ' '.$word;
    }
}
echo $resultText;

score -1 · Answer 2 · answered May 10 '21 at 23:58

While @Artier's answer might work acceptably, it's not the greatest idea to have loose UTF-8 combining marks in the source code and, from the bit I've gleaned from Google, they may not be covering the entire range of Arabic diacritics/combining marks.

Disclaimer: I know very little about Arabic, but I am very fussy about UTF-8.

@Artier's answer seems to have been culled from the accepted answer on this question, but the accepted answer is frequently not the optimal solution. One of these other two options from the same set of answers is likely closer to being canonically correct.

function strip_arabic_diacritics_1($str) {
    return preg_replace("~[\x{064B}-\x{065B}]~u", "", $str);
}

function strip_arabic_diacritics_2($str) {
    $ranges = [
        "~[\x{0600}-\x{061F}]~u",   
        "~[\x{063B}-\x{063F}]~u",   
        "~[\x{064B}-\x{065E}]~u",   
        "~[\x{066A}-\x{06FF}]~u",   
    ];

    return preg_replace($ranges, "", $str);
}

$str="اِنَّ الَّذِیۡنَ اٰمَنُوۡا وَ عَمِلُوا الصّٰلِحٰتِ وَ اَخۡبَتُوۡۤا اِلٰی رَبِّہِمۡ ۙ اُولٰٓئِکَ اَصۡحٰبُ الۡجَنَّۃِ ۚ ہُمۡ فِیۡہَا خٰلِدُوۡنَ";

$ptr="عملوا";

var_dump(
    $str,
    strip_arabic_diacritics_1($str),
    strip_arabic_diacritics_2($str)
);

Output:

string(265) "اِنَّ الَّذِیۡنَ اٰمَنُوۡا وَ عَمِلُوا الصّٰلِحٰتِ وَ اَخۡبَتُوۡۤا اِلٰی رَبِّہِمۡ ۙ اُولٰٓئِکَ اَصۡحٰبُ الۡجَنَّۃِ ۚ ہُمۡ فِیۡہَا خٰلِدُوۡنَ"
string(183) "ان الذیۡن اٰمنوۡا و عملوا الصٰلحٰت و اخۡبتوۡۤا الٰی ربہمۡ ۙ اولٰئک اصۡحٰب الۡجنۃ ۚ ہمۡ فیۡہا خٰلدوۡن"
string(127) "ان الذن امنوا و عملوا الصلحت و اخبتوا ال ربم  اولئ اصحب الجن  م فا خلدون"

As well, relying on explode() for word splitting is generally not feasible for human-written text as it will not respect punctuation or other non-space word breaks. This is the exact use case for IntlBreakIterator:

function strip_arabic_diacritics($str) {
    return strip_arabic_diacritics_2($str);
}

$br = IntlBreakIterator::createWordInstance();
$br->setText($str);

$output = '';
$ptr_stripped = strip_arabic_diacritics($ptr);

foreach($br->getPartsIterator() as $word) {
    $word_stripped = strip_arabic_diacritics($word);
    if( $ptr_stripped == $word_stripped ) {
        $output .= sprintf('<span class="...">%s</span>', $word);
    } else {
        $output .= $word;
    }
}

var_dump( $output );

Output:

string(290) "اِنَّ الَّذِیۡنَ اٰمَنُوۡا وَ <span class="...">عَمِلُوا</span> الصّٰلِحٰتِ وَ اَخۡبَتُوۡۤا اِلٰی رَبِّہِمۡ ۙ اُولٰٓئِکَ اَصۡحٰبُ الۡجَنَّۃِ ۚ ہُمۡ فِیۡہَا خٰلِدُوۡنَ"

The source string looks a bit wonky because of the switches between RTL and LTR, but it should render properly.

hey `ان الذن امنوا و عملوا الصلحت و اخبتوا ال ربم اولئ اصحب الجن م فا خلدون ` its completely different generated text. strip_arabic_diacritics_2 is not create correct text. its not skipping diacritics its skipping characters. — Artier, May 11 '21 at 00:57
Finally found [a decent table of the Arabic plane](https://en.wikipedia.org/wiki/Arabic_(Unicode_block)) and wow, yeah `fn_2()` covers a _lot_ of stuff that doesn't look like diacritics. 066A-06FF? That's 2/3rds of the plane! If you want to suggest a sane set of ranges I'd be happy to amend my answer to reflect that. — Sammitch, May 11 '21 at 01:04
[Even better breakdown straight from the unicode consortium.](https://www.unicode.org/charts/PDF/U0600.pdf) It looks like 064B-065F covers the combining marks in the plane, and then 06D6-06ED seem to be combining marks as well, but are primarily used for Quranic annotation. More annotations... honorifics, subtending marks, _super_tending marks... I am so lost. — Sammitch, May 11 '21 at 01:14
skip these characters `"\u0650" ,"\u0670" ,"\u0651" ,"\u06E1" ,"\u0656" ,"\u0657" ,"\u0618" ,"\u0619" ,"\u061A" ,"\u064D" ,"\u064E" ,"\u064F", "\u0653" ,"\u0652" , "\u064c" , "\u064d", "\u064b", "\u0651"` — Artier, May 11 '21 at 01:33

Using RegEx to highlight Arabic text

2 Answers2