1

I need to sort an array of strings, where elements are compared lexicographically as sequences of code point values, so that, for example, "Z" < "a" < "\udabc" < "�" < "".

  1. Is there a more efficient way of comparing strings, other than manually iterating over both of them and comparing the corresponding code points?
  2. What if it is guaranteed that the strings don't have any surrogate code points (but may have surrogate pairs, so "�" < "" should still hold)? Is there a more efficient procedure for this special case?

Note: there are many answers on StackOverflow explaining how to sort strings, but they either use the localeCompare order or the order defined by JavaScript comparison operators (which compare strings as sequences of UTF-16 code units). I am not interested in either of those.

abacabadabacaba
  • 2,662
  • 1
  • 13
  • 18
  • What about using `charCodeAt` to convert them to numbers and sorting by that? – evolutionxbox Nov 27 '21 at 13:06
  • @evolutionxbox The strings can be of arbitrary length, single-character strings are only used as examples. – abacabadabacaba Nov 27 '21 at 13:09
  • I wonder if there might be a way to use the [Unicode Codepoint Collation](https://www.w3.org/2005/xpath-functions/collation/codepoint/) for a `Intl` comparison function – Bergi Nov 27 '21 at 15:01

1 Answers1

0

How to sort strings in JavaScript by code point values?


It appears to be a surprisingly difficult problem. Here's a Proof Of Concept (POC) implementation:

'use strict';

function compareCodePoints(s1, s2) {
    const len = Math.min(s1.length, s2.length);
    let i = 0;
    for (const c1 of s1) {
        if (i >= len) {
            break;
        }
        const cp1 = s1.codePointAt(i);
        const cp2 = s2.codePointAt(i);
        const order = cp1 - cp2;
        if (order !== 0) {
            return order;
        }
        i++;
        if (cp1 > 0xFFFF) {
            i++;
        }
    }
    return s1.length - s2.length;
}

let s =[];
let s1 = "abcz";
let s2 = "abcz";

s = [s1, s2];
console.log(s);
s.sort(compareCodePoints);
console.log(s);

console.log()

s = [s2, s1];
console.log(s);
s.sort(compareCodePoints);
console.log(s);

console.log()

s1 = "a";
s2 = "";

console.log([s1, s2]);
console.log(compareCodePoints(s1, s2));
console.log([s2, s1]);
console.log(compareCodePoints(s2, s1));

$ node codepoint.poc.js
[ 'abcz', 'abcz' ]
[ 'abcz', 'abcz' ]

[ 'abcz', 'abcz' ]
[ 'abcz', 'abcz' ]

[ 'a', '' ]
1
[ '', 'a' ]
-1
$
rocka2q
  • 2,473
  • 4
  • 11
  • 1
    Your code does `for (const c1 of s1)` but `c1` is not used anywhere. Also, `compareCodePoints("a", "")` returns NaN, which Array.sort treats like 0. – abacabadabacaba Nov 27 '21 at 18:04
  • @abacabadabacaba: I've fixed the `NaN` bug. It's a POC so there may be other corner cases. The `for (const c1 of s1)` construct is used to ensure correct iteration over UTF-16. – rocka2q Nov 27 '21 at 18:38
  • OP asked for "*other than manually iterating over both of them and comparing the corresponding code points*" – Bergi Nov 27 '21 at 21:39
  • @Bergi: The OP question title, "How to sort strings in JavaScript by code point values?" See my answer for an example of sorting strings in JavaScript by code point values. An OP question in full, "Is there a more efficient way of comparing strings, other than manually iterating over both of them and comparing the corresponding code points?" The answer is no. See the [Unicode Standard](http://www.unicode.org/versions/latest/) for details. – rocka2q Nov 28 '21 at 00:47