Sorting Arabic words in Java

Question

I have a list of words in Arabic that I'd like to sort. I have tried the standard Collator with different Locales (like English or French but without much hope) and I have even created my own RuleBasedCollator but to no avail. Apparently the default sorting relies on the unicode values order, which in many cases works but apparently not in this one.

Following the instructions of the javadocs, the RuleBasedCollator requires a string specifying the characters in the order you want them sorted. I created the following string taking the unicode codes from this table:

String arabicLetters = "< \u0623=\uFE83=\uFE84 < \u0628=\uFE8F=\uFE90=\uFE92=\uFE91 < \u062A=\uFE95=\uFE96=\uFE98=\uFE97 < \u062B=\uFE99=\uFE9A=\uFE9C=\uFE9B < \u062C=\uFE9D=\uFE9E=\uFEA0=\uFE9F < \u062D=\uFEA1=\uFEA2=\uFEA4=\uFEA3 < \u062E=\uFEA5=\uFEA6=\uFEA8=\uFEA7 < \u062F=\uFEA9=\uFEAA < \u0630=\uFEAB=\uFEAC < \u0631=\uFEAD=\uFEAE < \u0632=\uFEAF=\uFEB0 < \u0633=\uFEB1=\uFEB2=\uFEB4=\uFEB3 < \u0634=\uFEB5=\uFEB6=\uFEB8=\uFEB7 < \u0635=\uFEB9=\uFEBA=\uFEBC=\uFEBB < \u0636=\uFEBD=\uFEBE=\uFEC0=\uFEBF < \u0637=\uFEC1=\uFEC2=\uFEC4=\uFEC3 < \u0638=\uFEC5=\uFEC6=\uFEC8=\uFEC7 < \u0639=\uFEC9=\uFECA=\uFECC=\uFECB < \u063A=\uFECD=\uFECE=\uFED0=\uFECF < \u0641=\uFED1=\uFED2=\uFED4=\uFED3 < \u0642=\uFED5=\uFED6=\uFED8=\uFED7 < \u0643=\uFED9=\uFEDA=\uFEDC=\uFEDB < \u0644=\uFEDD=\uFEDE=\uFED0=\uFEDF < \u0645=\uFEE1=\uFEE2=\uFEE4=\uFEE3 < \u0646=\uFEE5=\uFEE6=\uFEE8=\uFEE7 < \u0647=\uFEE9=\uFEEA=\uFEEC=\uFEEB < \u0648=\uFEED=\uFEEE < \u064A=\uFEF1=\uFEF2=\uFEF4=\uFEF3 < \u0622=\uFE81=\uFE82 < \u0629=\uFE93=\uFE94 < \u0649=\uFEEF=\uFEF0 < \u0627";

The Arabic letters can take four forms depending on the position where they are in a word. Therefore what I did in the rules string above is making equal all 4 forms of each letter. Then I indicate the order of the letters separating them with '<'. I imagine that this is the correct way to do it.

Now, if I have a collection with the days of the week (sorted in this case by day of the week, not 'alphabetically'):

الأَحَد, الاِثنَين, الثُّلاثاء, الأَربِعاء, الخَميس, الجُمعة,السَّبت

The results I am getting are not sorted at all:

الأَحَد, الخَميس, الاِثنَين, الثُّلاثاء, الأَربِعاء, السَّبت, الجُمعة

Besides, for such a small amount of words it takes a considerable amount of time which makes it unusable.

Does anybody know if I'm doing something wrong or if there is a life-saving library that already handles this?

I did some googling before writing this and I'm surprised I didn't find a single result.

Thanks!

UPDATE with code:

public static class TranslatableComparator implements java.util.Comparator<Translatable> {
        @Override
        public int compare(Translatable t1, Translatable t2) {

            String sortingRules = "< \u0623=\uFE83=\uFE84 < \u0628=\uFE8F=\uFE90=\uFE92=\uFE91 < \u062A=\uFE95=\uFE96=\uFE98=\uFE97 < \u062B=\uFE99=\uFE9A=\uFE9C=\uFE9B < \u062C=\uFE9D=\uFE9E=\uFEA0=\uFE9F < \u062D=\uFEA1=\uFEA2=\uFEA4=\uFEA3 < \u062E=\uFEA5=\uFEA6=\uFEA8=\uFEA7 < \u062F=\uFEA9=\uFEAA < \u0630=\uFEAB=\uFEAC < \u0631=\uFEAD=\uFEAE < \u0632=\uFEAF=\uFEB0 < \u0633=\uFEB1=\uFEB2=\uFEB4=\uFEB3 < \u0634=\uFEB5=\uFEB6=\uFEB8=\uFEB7 < \u0635=\uFEB9=\uFEBA=\uFEBC=\uFEBB < \u0636=\uFEBD=\uFEBE=\uFEC0=\uFEBF < \u0637=\uFEC1=\uFEC2=\uFEC4=\uFEC3 < \u0638=\uFEC5=\uFEC6=\uFEC8=\uFEC7 < \u0639=\uFEC9=\uFECA=\uFECC=\uFECB < \u063A=\uFECD=\uFECE=\uFED0=\uFECF < \u0641=\uFED1=\uFED2=\uFED4=\uFED3 < \u0642=\uFED5=\uFED6=\uFED8=\uFED7 < \u0643=\uFED9=\uFEDA=\uFEDC=\uFEDB < \u0644=\uFEDD=\uFEDE=\uFED0=\uFEDF < \u0645=\uFEE1=\uFEE2=\uFEE4=\uFEE3 < \u0646=\uFEE5=\uFEE6=\uFEE8=\uFEE7 < \u0647=\uFEE9=\uFEEA=\uFEEC=\uFEEB < \u0648=\uFEED=\uFEEE < \u064A=\uFEF1=\uFEF2=\uFEF4=\uFEF3 < \u0622=\uFE81=\uFE82 < \u0629=\uFE93=\uFE94 < \u0649=\uFEEF=\uFEF0 < \u0627";
            RuleBasedCollator col = null;
            try {
                col = new RuleBasedCollator(sortingRules);
            } catch (ParseException e) {
                //col = (RuleBasedCollator)RuleBasedCollator.getInstance(Locale.FRENCH);
            }

            return col.getCollationKey(t1.getTranslation().getText()).compareTo(col.getCollationKey(t2.getTranslation().getText()));
        }
    }

can you post some more of the code please? just so we can see whats actually happening? — shaunvxc, Jun 05 '13 at 20:22
I'm not completely familiar with RuleBasedCollator, but what happens when you separate the characters whose value you want to be equal with commas? Something like this: "< a,A< b,B< c,C< d,D — shaunvxc, Jun 05 '13 at 20:28
and no exceptions are being caught, correct?, nevermind, clearly not cause you are still getting a result. i'll try and take a look into this — shaunvxc, Jun 05 '13 at 20:33
Does ICU4J have specific rules for Arabic? It uses the Common Locale Data Repository. See http://site.icu-project.org/. — Eric Jablow, Jun 05 '13 at 20:44
are you using the CollationKeys properly? http://docs.oracle.com/javase/1.5.0/docs/api/java/text/CollationKey.html — shaunvxc, Jun 05 '13 at 20:45
One reason it's slow is that you're creating a new collator every time. Since a collator is stateless, you only need to create a single collator for a single set of rules, e.g. use a constant like `static final Collator c;` `static { try { c = new RuleBasedCollator(rules); } catch(ParseException e) { throw new RuntimeException(e); } }` Also, per the CollationKey javadoc, it's generally faster to use Collator.compare for doing comparisons one-at-a-time like this. — superEb, Jun 06 '13 at 02:01

rxg · Accepted Answer · 2013-06-06T11:08:05.507

5

You don't need to define your own collator, just use the built-in one for Arabic. Your Comparator then looks like this

public int compare(Translatable t1, Translatable t2) {
        Collator.getInstance(new Locale("ar")).compare(t1.getTranslation().getText(), t2.getTranslation().getText());
}

(You can check if a collator is available for Arabic by browsing the results from Collator.getAvailableLocales().)

As noted in the comments, if you're worried about performance you should calculate the collation keys, store them in your Translatable objects and sort on them instead.

If you really want to see where the differences are between what you defined and the standard collator, just print out the rules:

System.out.println((RuleBasedCollator) Collator.getInstance(new Locale("ar"))).getRules();

edited Jun 06 '13 at 11:08

answered Jun 06 '13 at 10:51

rxg

3,777
22
42

I fixed it. I was doing something wrong myself somewhere else in the code. Basically I was modifying the List before sorting it with this method and therefore I was getting the wrong results. As rxg mentions, there is no need to use an special Collator for this. The sorting will work taking the Unicode values of the characters, and since they are ordered alphabetically, that's it. About the performance, indeed I hadn't noticed that I was creating an object for every comparison. I changed it to what superEB suggested and now it takes less than a second. Thanks! – Gonan Jun 06 '13 at 11:23
Since this is what I was doing before trying the RuleBasedCollator approach, and basically because this is the way to do it, I'll mark it as the answer. – Gonan Jun 06 '13 at 11:30

Sorting Arabic words in Java

1 Answers1

Linked