24

I was reading the specification of Unicode @ Wikipedia (Arabic Unicode) and I see that each of the Arabic digits has 2 Unicode code points. For example 1 is defined as U+0661 and as U+06F1.

Which one should I use?

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
Karim
  • 6,113
  • 18
  • 58
  • 83

3 Answers3

48

According to the code charts, U+0660 .. U+0669 are ARABIC-INDIC DIGIT values 0 through 9, while U+06F0 .. U+06F9 are EXTENDED ARABIC-INDIC DIGIT values 0 through 9.

In the Unicode 3.0 book (5.2 is the current version, but these things don't change much once set), the U+066n series of glyphs are marked 'Arabic-Indic digits' and the U+06Fn series of glyphs are marked 'Eastern Arabic-Indic digits (Persian and Urdu)'. It also notes:

  • U+06F4 - 'different glyphs in Persian and Urdu'
  • U+06F5 - 'Persian and Urdu share glyph different from Arabic'
  • U+06F6 - 'Persian glyph different from Arabic'
  • U+06F7 - 'Urdu glyph different from Arabic'

For comparison:

  • U+066n: ٠١٢٣٤٥٦٧٨٩
  • U+06Fn: ۰۱۲۳۴۵۶۷۸۹

Or, enlarged by making the information into a title:

U+066n: ٠١٢٣٤٥٦٧٨٩

U+06Fn: ۰۱۲۳۴۵۶۷۸۹

Or:

     U+066n    U+06Fn
0      ٠         ۰
1      ١         ۱
2      ٢         ۲
3      ٣         ۳
4      ٤         ۴
5      ٥         ۵
6      ٦         ۶
7      ٧         ۷
8      ٨         ۸
9      ٩         ۹

(Whether you can see any of those, and how clearly they are differentiated may depend on your browser and the fonts installed on your machine as much as anything else. I can see the difference on 4 and 6 clearly; 5 looks much the same in both.)

Based on this information, if you are working with Arabic from the Middle East, use the U+066n series of digits; if you are working with Persian or Urdu, use the U+06Fn series of digits. As a Unicode application, you should accept either set of codes as valid digits (but you might look askance at a sequence that mixed the two sets of digits - or you might just leave well alone).

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • 4
    +1: would have made that answer if you hadn't beaten me by 1 minute ;-) It's a pity everybody seems to think that the difference doesn't matter and rush to make ill-advised answers... – Arthur Reutenauer Nov 04 '09 at 21:06
  • Roughly this means that they should never actually have had two code points. The fact that they look different in three different languages is a font issue and shouldn't have been encoded in the script. There are a few issues like that with Urdu characters encoded in different points because they look different even though they are the same letter. Thus the correct solution would have been a font to display the character differently if its Arabic or if its Urdu. Joys of Unicode learning the Arabic script. – Dwayne Jan 14 '16 at 15:14
  • @Dwayne: yes, no, maybe. I can see that argument, but then why did they not retarget code points U+0030..U+0039 and say that in Arabic fonts, those should appear as the Arabic digits. There probably were good enough reasons for the decision — even if there weren't, that's the state of the standard. Note that the presentation forms of Arabic letters are now in supplements. The Arabic ranges are: Arabic in U+0600..U+06FF; Arabic Supplement in U+0750..U+077F; Arabic Extended-A in U+08A0..U+08FF; Arabic Presentation Forms-A in U+FB50..U+FDFF; and Arabic Presentation Forms-B in U+FE70..U+FEFF. – Jonathan Leffler Jan 14 '16 at 15:31
  • Yes, Urdu almost invariably uses the Heart-shaped 5, and not the round one... Best would be to normalize in your own program (if you're parsing). Maybe there's a lib for that. But then again, who would write these characters with a keybard is beyond me, maybe there's a mechanism for that. –  May 20 '16 at 06:48
4

In general you should not hard-code such info in your application.

  • On Windows you can use GetLocaleInfo with LOCALE_SNATIVEDIGITS.
  • On Mac CFNumberFormatterCopyProperty with kCFNumberFormatterZeroSymbol.
  • Or use something like ICU.

There are Arabic countries that don't use the Arabic-Indic digits by default. So there is no direct mapping saying Arabic -> Arabic-Indic digits.

And the user might have changed the defaults in the Control Panel anyway.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
Mihai Nita
  • 5,547
  • 27
  • 27
1

Which code do you prefer for representing the number 4, U+0664 or U+06F4?

(٤ or ۴ )?

To be consistent, let this choice guide which codes you use for 1, 2, and the other duplicate codes.

mob
  • 117,087
  • 18
  • 149
  • 283