16

how to remove dynamically Arabic diacritic I'm designing an ebook "chm" and have multi html pages contain Arabic text but some time the search engine want highlight some of Arabic words because its diacritic so is it possible when page load to use JavaScript functions that would strip the Arabic diacritic text ?? but must have option to enabled again so i don't want to remove it from HTML physically but temporary,

the thing is i don't know where to start and what is the right function to use

thank you :)

For Example

Text : الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ
converted to : الحمد لله رب العالمين 
hippietrail
  • 15,848
  • 18
  • 99
  • 158
Jomart Mirza
  • 187
  • 1
  • 11

8 Answers8

14

I wrote this function which handles strings with mixed Arabic and English characters, removing special characters (including diacritics) and normalizing some Arabic characters like converting all ة's into ه's.

normalize_text = function(text) {

  //remove special characters
  text = text.replace(/([^\u0621-\u063A\u0641-\u064A\u0660-\u0669a-zA-Z 0-9])/g, '');

  //normalize Arabic
  text = text.replace(/(آ|إ|أ)/g, 'ا');
  text = text.replace(/(ة)/g, 'ه');
  text = text.replace(/(ئ|ؤ)/g, 'ء')
  text = text.replace(/(ى)/g, 'ي');

  //convert arabic numerals to english counterparts.
  var starter = 0x660;
  for (var i = 0; i < 10; i++) {
    text.replace(String.fromCharCode(starter + i), String.fromCharCode(48 + i));
  }

  return text;
}
<input value="الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ" type="text" id="input">
<button onclick="document.getElementById('input').value = normalize_text(document.getElementById('input').value)">Normalize</button>
Rashad Saleh
  • 2,686
  • 1
  • 23
  • 28
  • Very nice... Just note that ى is not ي, it should be ا it is called الألف المقصورة. – Khalid Almannai Jul 27 '23 at 08:09
  • 1
    When comparing texts for equality, it can be beneficial to include false positives than to miss out on a true positive. For example, in a search functionality, you might want to match على in the original text with the search term علي, just in case the original text had a spelling mistake, or if it was written in the Egyptian way where sometimes they omit the dots of ي. This is the original reason for the normalization part of the answer. – Rashad Saleh Jul 27 '23 at 09:58
9

Try this

Text : الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ
converted to : الحمد لله رب العالمين 

http://www.suhailkaleem.com/2009/08/26/remove-diacritics-from-arabic-text-quran/

The code is C# not javascript though. Still trying to figure out how to achieve this in javascript

EDIT: Apparently it's very easy in javascript. The diacratics are stored as separate "letters" and they can be removed quite easily.

var CHARCODE_SHADDA = 1617;
var CHARCODE_SUKOON = 1618;
var CHARCODE_SUPERSCRIPT_ALIF = 1648;
var CHARCODE_TATWEEL = 1600;
var CHARCODE_ALIF = 1575;

function isCharTashkeel(letter)
{
    if (typeof(letter) == "undefined" || letter == null)
        return false;

    var code = letter.charCodeAt(0);
    //1648 - superscript alif
    //1619 - madd: ~
    return (code == CHARCODE_TATWEEL || code == CHARCODE_SUPERSCRIPT_ALIF || code >= 1612 && code <= 1631); //tashkeel
}

function stripTashkeel(input)
{
  var output = "";
  //todo consider using a stringbuilder to improve performance
  for (var i = 0; i < input.length; i++)
  {
    var letter = input.charAt(i);
    if (!isCharTashkeel(letter)) //tashkeel
      output += letter;                                
  }


return output;                   
}

Edit: Here is another way to do it using BuckData http://qurandev.github.com/

Advantages Buck uses less bandwidth In Javascript, u can search thru entire Buck quran text in 1 shot. intuitive compared to Arabic search Buck to Arabic and Arabic to Buck is a simple js call. Play with live sample here: http://jsfiddle.net/BrxJP/ You can strip out all vowels from Buck text in few millisecs. Why do this? u can search in javascript, ignoring the taskheel differences (Fathah, Dammah, Kasrah). Which leads to more hits. Regex + buck text can lead to awesome optimizations. All the searches can be run locally. http://qurandev.appspot.com How data generated? just one-to-one mapping using: http://corpus.quran.com/java/buckwalter.jsp

Sameer Alibhai
  • 3,092
  • 4
  • 36
  • 36
4

Here's a javascript code that can handle removing Arabic diacritics nearly all the time.

var arabicNormChar = {
    'ك': 'ک', 'ﻷ': 'لا', 'ؤ': 'و', 'ى': 'ی', 'ي': 'ی', 'ئ': 'ی', 'أ': 'ا', 'إ': 'ا', 'آ': 'ا', 'ٱ': 'ا', 'ٳ': 'ا', 'ة': 'ه', 'ء': '', 'ِ': '', 'ْ': '', 'ُ': '', 'َ': '', 'ّ': '', 'ٍ': '', 'ً': '', 'ٌ': '', 'ٓ': '', 'ٰ': '', 'ٔ': '', '�': ''
}

var simplifyArabic  = function (str) {
    return str.replace(/[^\u0000-\u007E]/g, function(a){ 
        var retval = arabicNormChar[a]
        if (retval == undefined) {retval = a}
        return retval; 
    }).normalize('NFKD').toLowerCase();
}

//now you can use simplifyArabic(str) on Arabic strings to remove the diacritics

Note: you may override the arabicNormChar to your own preferences.

Sina Mansour L.
  • 418
  • 4
  • 8
2

Use this regex to catch all tashkeel

[ؐ-ًؚٟ]

Yusuf
  • 140
  • 11
1

I tried the following solution and it works fine:

const str = 'الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ';
const withoutDiacs = str.replace(/([^\u0621-\u063A\u0641-\u064A\u0660-\u0669a-zA-Z 0-9])/g, '');
console.log(withoutDiacs); //الحمد لله رب العالمين
Reference: https://www.overdoe.com/javascript/2020/06/18/arabic-diacritics.html
Ahmed Ismail
  • 912
  • 11
  • 21
  • I used this Regex with c# and not correct because remove `ی` from sentences. For example for this Ayah 'بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِیمِ' returned `بسم الله الرحمن الرحم` – Sayed Abolfazl Fatemi Aug 11 '21 at 10:48
0

This site has some routines for Javascript Unicode normalization which could be used to do what you're attempting. If nothing else it could provide a good starting point.

If you can preprocess the data, Python has good Unicode routines to make easy work of these sorts of transformations. This might be a good option if you can preprocess your CHM file to produe a separate index file which could be then merged into your CHM:

import unicodedata

def _strip(text):
    return ''.join([c for c in unicodedata.normalize('NFD', text) \
        if unicodedata.category(c) != 'Mn'])

composed = u'\xcd\xf1\u0163\u0115\u0155\u0148\u0101\u0163\u0129\u014d' \
    u'\u0146\u0105\u013c\u012d\u017e\u0119'

_strip(composed)
'Internationalize'
samplebias
  • 37,113
  • 6
  • 107
  • 103
0

A shorter approach to remove the Arabic diacritics (either the 8 Basic diacritics or the full 52 diacritics) could be as follows:

Remove Basic Diacritics

function removeTashkeelBasic(s) {return s.replace(/[ً-ْ]/g,'');}



//===================
//     Test Cases
//===================
console.log(removeTashkeelBasic('حِسَابٌ وَحِسَابًا مِنْ ثَلَاثُمِئَةِ رِيَالٍ قَطَرِيٍّ'));
console.log(removeTashkeelBasic('بِسْمِ ٱللَّٰهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ'));

Remove All Arabic Diacritics

function removeTashkeelAll(s) {return s.replace(/[ؐ-ًؕ-ٖٓ-ٟۖ-ٰٰۭ]/g,'');}


//===================
//     Test Cases
//===================
console.log(removeTashkeelAll('حِسَابٌ وَحِسَابًا مِنْ ثَلَاثُمِئَةِ رِيَالٍ قَطَرِيٍّ'));
console.log(removeTashkeelAll('بِسْمِ ٱللَّٰهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ'));
Mohsen Alyafei
  • 4,765
  • 3
  • 30
  • 42
0

Here is another approach based on the Arabic Unicode block:

const map = {
  'آ': 'ا',
  'أ': 'ا',
  'إ': 'ا',
  'ا': 'ا',
  'ٱ': 'ا',
  'ٲ': 'ا',
  'ٳ': 'ا',
  'ؤ': 'و',
  'ئ': 'ى',
  'ؽ': 'ؽ',
  'ؾ': 'ؾ',
  'ؿ': 'ؿ',
  'ي': 'ى',
  'ب': 'ب',
  'ت': 'ت',
  'ؠ': 'ؠ',
  'ة': 'ه',
  'ث': 'ث',
  'ج': 'ج',
  'ح': 'ح',
  'خ': 'خ',
  'د': 'د',
  'ذ': 'ذ',
  'ر': 'ر',
  'ز': 'ز',
  'س': 'س',
  'ش': 'ش',
  'ص': 'ص',
  'ض': 'ض',
  'ط': 'ط',
  'ظ': 'ظ',
  'ع': 'ع',
  'غ': 'غ',
  'ػ': 'ک',
  'ؼ': 'ک',
  'ف': 'ف',
  'ق': 'ق',
  'ك': 'ك',
  'ګ': 'ك',
  'ڬ': 'ك',
  'ڭ': 'ڭ',
  'ڮ': 'ك',
  'ل': 'ل',
  'م': 'م',
  'ن': 'ن',
  'ه': 'ه',
  'و': 'و',
  'ى': 'ى',
  'ٸ': 'ى',
  'ٵ': 'ءا', // hamza alef?
  'ٶ': 'ءو', // hamza waw?
  'ٹ': 'ٹ',
  'ٺ': 'ٺ',
  'ٻ': 'ٻ',
  'ټ': 'ت',
  'ٽ': 'ت',
  'پ': 'پ',
  'ٿ': 'ٿ',
  'ڀ': 'ڀ',
  'ځ': 'ءح',
  'ڂ': 'ح',
  'ڃ': 'ڃ',
  'ڄ': 'ڄ',
  'څ': 'ح',
  'چ': 'چ',
  'ڇ': 'ڇ',
  'ڈ': 'ڈ',
  'ډ': 'د',
  'ڊ': 'د',
  'ڋ': 'د',
  'ڌ': 'ڌ',
  'ڍ': 'ڍ',
  'ڎ': 'ڎ',
  'ڏ': 'د',
  'ڐ': 'د',
  'ڑ': 'ڑ',
  'ڒ': 'ر',
  'ړ': 'ر',
  'ڔ': 'ر',
  'ڕ': 'ر',
  'ږ': 'ر',
  'ڗ': 'ر',
  'ژ': 'ژ',
  'ڙ': 'ڙ',
  'ښ': 'س',
  'ڛ': 'س',
  'ڜ': 'س',
  'ڝ': 'ص',
  'ڞ': 'ص',
  'ڟ': 'ط',
  'ڠ': 'ع',
  'ڡ': 'ڡ',
  'ڢ': 'ڡ',
  'ڣ': 'ڡ',
  'ڤ': 'ڤ',
  'ڥ': 'ڡ',
  'ڦ': 'ڦ',
  'ڧ': 'ق',
  'ڨ': 'ق',
  'ک': 'ک',
  'ڪ': 'ڪ',
  'گ': 'گ',
  'ڰ': 'گ',
  'ڱ': 'ڱ',
  'ڲ': 'گ',
  'ڳ': 'ڳ',
  'ڴ': 'گ',
  'ڵ': 'ل',
  'ڶ': 'ل',
  'ڷ': 'ل',
  'ڸ': 'ل',
  'ڹ': 'ن',
  'ں': 'ں',
  'ڻ': 'ڻ',
  'ڼ': 'ن',
  'ڽ': 'ن',
  'ھ': 'ه',
  'ڿ': 'چ',
  'ۀ': 'ه',
  'ہ': 'ہ',
  'ۂ': 'ءہ',
  'ۃ': 'ہ',
  'ۄ': 'و',
  'ۅ': 'ۅ',
  'ۆ': 'ۆ',
  'ۇ': 'ۇ',
  'ۈ': 'ۈ',
  'ۉ': 'ۉ',
  'ۊ': 'و',
  'ۋ': 'ۋ',
  'ی': 'ی',
  'ۍ': 'ي',
  'ێ': 'ي',
  'ۏ': 'و',
  'ې': 'ې',
  'ۑ': 'ي',
  'ے': 'ے',
  'ۓ': 'ے',
  'ە': 'ە',
  'ۺ': 'ش',
  'ۻ': 'ض',
  'ۼ': 'ۼ',
  'ۿ': 'ه'
}

function removeDiacritics(text) {
  const symbols = [...text]
  const result = []
  for (const symbol of symbols) {
    if (map[symbol]) {
      result.push(symbol)
    }
  }
  return result.join('')
}

Some letters could still be considered to have diacritics such as ژ "jeh" which looks like ر "reh". But since it is given a different fundamental name in Arabic, I made it not get stripped of its "extra markings" to become "reh". That happened in a few cases, such as with ڡ "feh" and ڢ "dot below feh", but ڤ and ڦ were given fundamental names, but not ڥ for example. Not sure the best way to approach those. I don't know the exact definition of what is a diacritic and what is not to a 100% degree, but this should be a good start.

Also, the "hamza + letter" ligatures were converted into hamza and the letter separately.

If you know how to improve this, please comment and add a fix if you'd like.

Lance
  • 75,200
  • 93
  • 289
  • 503