Efficient Means of Implementing Collation & Sorting?

Question

I'm writing lexicography software, which may theoretically need to sort tens of thousands of strings with arbitrary (dictionary-project-specific) collations. There are two means of specifying custom collations:

a map of graphemes to unicode-style multi-level collation keys.
an array of alphabetic graphemes (possibly including digraphs, etc.) in sorting order, which can be internally transformed into a map of collation keys.

The naive method of comparing strings is to check grapheme-by-grapheme until you find a mismatch, and then look up the collation keys for the mismatched graphemes to compare, but I'm hoping there's a more efficient way of doing it.

The best idea I've got so far depends on noticing that strings of equal length can be treated as little-endian base-n numbers, so I can pre-compute an integer key for each string which turns collation into cheap integer comparison. But, this breaks for strings of different length (a big deal when sorting a dictionary), and there's no bound on the size of integers that could be generated. To account for length differences, I thought I could compute a list of keys for all prefixes of each string, and then just compare the keys for prefixes of length equal to the shorter string being compared. That seems to do pretty well, but key sizes are still unbounded, and storing the keys could use a lot of memory.

Is there a way to improve that approach? Or am I just going about it entirely wrong, and there's a much better means of sorting strings with arbitrary collations?

score 2 · Answer 1 · answered Oct 25 '12 at 01:04

How about a grapheme-by-grapheme Radix sort? You get Big O n(number of words) * m(length of longest word) sorting. The idea should be fairly simple put all the words that start with A in the A bucket, Bs in the B bucket and so on down the characters in the word.

score 1 · Answer 2 · answered Oct 25 '12 at 00:10

I'm no expert but I might suggest some kind of hybrid between the naive approach and your approach. Where you look at a fixed number of bytes in each string, treat it as a little-endian number and use a pre-calculated collation. Then if they are the same move to the next set of the same length and do the same. The tricky part is dealing with variable length graphemes (such as UTF-8 or digraphs). The simplest solution would be to use a fixed-width representation in the dictionary, but there might be another, more sophisticated solution, which I can't think of right now.

Once you get to the end of the shorter string you zero extend it to meet the next boundary and then do the comparison.

You could also look at open-source implementations of collations, and see if they do something more sophisticated (for instance the GNU implementation of the strcoll C function).

strcoll uses strcmp after transforming the strings to be compared with strxfrm. The GNU libc documentation recommends pre-computing the strxfrm transformation of every string and then just using strcmp on that for efficient sorting. I suppose I should look at how strxfrm works to see if there's a way of generating better comparison keys than just transformed strings, if I can find a strxfrm implementation.... — Logan R. Kearsley, Oct 25 '12 at 16:04

Efficient Means of Implementing Collation & Sorting?

2 Answers2