I'm writing lexicography software, which may theoretically need to sort tens of thousands of strings with arbitrary (dictionary-project-specific) collations. There are two means of specifying custom collations:
- a map of graphemes to unicode-style multi-level collation keys.
- an array of alphabetic graphemes (possibly including digraphs, etc.) in sorting order, which can be internally transformed into a map of collation keys.
The naive method of comparing strings is to check grapheme-by-grapheme until you find a mismatch, and then look up the collation keys for the mismatched graphemes to compare, but I'm hoping there's a more efficient way of doing it.
The best idea I've got so far depends on noticing that strings of equal length can be treated as little-endian base-n numbers, so I can pre-compute an integer key for each string which turns collation into cheap integer comparison. But, this breaks for strings of different length (a big deal when sorting a dictionary), and there's no bound on the size of integers that could be generated. To account for length differences, I thought I could compute a list of keys for all prefixes of each string, and then just compare the keys for prefixes of length equal to the shorter string being compared. That seems to do pretty well, but key sizes are still unbounded, and storing the keys could use a lot of memory.
Is there a way to improve that approach? Or am I just going about it entirely wrong, and there's a much better means of sorting strings with arbitrary collations?