How to create a custom collator?

Question

I am using the following code to use as function to sort a list of strings:

bool stringLessThan(const string& str1, const string& str2) 
{
   const collate<char>& col = use_facet<collate<char> >(locale()); // Use the global locale

   string s1(str1);
   string s2(str2);

   transform(s1.begin(), s1.end(), s1.begin(), ::tolower);
   transform(s2.begin(), s2.end(), s2.begin(), ::tolower);
   const char* pb1 = s1.data();
   const char* pb2 = s2.data();
   return (col.compare(pb1, pb1 + s1.size(), pb2, pb2 + s2.size()) < 0);
}

I am setting the global locale as:

locale::global(locale("pt_BR.UTF-8"));

If I use the en_EN.UTF-8 locale, the words with accent in my language (portuguese-Brazil) will be in different order that I want. So I use pt_BR.UTF-8. But, the string "as" is before "a", and I want "a" and then "as".

The reason is that collator ignores the spaces, and strings like:

a pencil
an apple

will be considered as:

apencil
anapple

and if sorted, will appear in this order:

an apple
a pencil

but I want:

a pencil
an apple

I made this with Java and the solution was create a custom collator. But in c++ how can I handle with it?

I don't have an answer for you now, but I do have a tip: you should not create temporary strings inside your comparison function if you care at all about performance. Construction and copying of these strings will be quite inefficient, and if you are using this comparison function e.g. for a std::map, the comparator will be called many times, compounding the inefficiency. — John Zwinck, Dec 06 '15 at 15:46
On my machine `en_US.utf8` and `pt_BR.utf8` seem to do exactly the same thing (and they ignore the case so your tolower transformation is redundant). — n. m. could be an AI, Dec 06 '15 at 17:40
@n.m. I agree that the transformation is redundant. But `en_US.UTF-8` and `pt_BR.UTF-8` is different when use accent. Try sort this: `Terror`, `Suspense` and `Épico`. Will be different. — ViniciusArruda, Dec 06 '15 at 17:51
They work identically for me (Épico, Suspense, Terror). I use `std::sort(v.begin(), v.end(), std::locale("en_US.utf8"));` and `std::sort(v.begin(), v.end(), std::locale("pt_BR.utf8"));`. They are different from say `std::sort(v.begin(), v.end(), std::locale("C"));` What is the order on your machine? — n. m. could be an AI, Dec 06 '15 at 17:55
For `pt_BR.UTF-8`: `Épico`, `Suspense` and `Terror`. For `en_US.UTF-8`: `Suspense`, `Terror` and `Épico`. I am passing the function `stringLessThan` to `sort` as parameter. — ViniciusArruda, Dec 06 '15 at 17:59
Your en_US.UTF-8 seems broken. What the program [on this page](http://en.cppreference.com/w/cpp/locale/collate) prints? — n. m. could be an AI, Dec 06 '15 at 18:04

score 2 · Accepted Answer · answered Dec 06 '15 at 17:15

Try creating your own collator class or comparison function. While in Java the more idiomatic approach might be to do this through extension, in c++ and for your case I'd recommend using composition.

This simply means that your custom collator class would have a collator member that it would use to help it perform collation, as opposed to deriving from the collate class.

As for your rules for comparison, it seems that you will need to explicitly implement your own logic. If you don't want spaces to be ignored, perhaps you should tokenize your strings.

How to create a custom collator?

1 Answers1