3

Today, I stumbled onto a weird issue with the JavaScript / ECMAScript Internationalization API that I can't find a suitable explanation anywhere. I am getting different results when comparing two specific characters - the forward-slash (/) and the underscore (_) characters using:

  1. plain-vanilla / traditional UTF-16 based comparison
  2. the Intl.Collator.prototype.compare() method

The Plain / Traditional UTF-16 based comparison

// Vanilla JavaScript comparator
const cmp = (a, b) => a < b ? -1 : a > b ? 1 : 0;

console.log(cmp('/', '_'));
// Output: -1

// When sorting
const result = ['/', '_'].sort(cmp);

console.log(result);
// Output: ['/', '_']

The Intl.Collator.prototype.compare() method

const collator = new Intl.Collator('en', {
  sensitivity: 'base',
  numeric: true
});

console.log(collator.compare('/', '_'));
// Output: 1

// When sorting
const result = ['/', '_'].sort(collator.compare);

console.log(result);
// Output: ['_', '/']

Questions

Why do both techniques yield different results? Is this a bug in the ECMAScript implementation? What am I missing / failing to understand here? Are there other such character combinations which would yield different results for the English (en) language / locale?

Edit 2021-10-01

As @t-j-crowder pointed out, replaced all "ASCII" to "UTF-16".

akaustav
  • 199
  • 1
  • 7

1 Answers1

3

In general

When you use < and > on strings, they're compared according to their UTF-16 code unit values (not ASCII, but ASCII overlaps with those values for many common characters). This is, to put it mildly, problematic. For instance, ask the French if "z" < "é" should really be true (indicating that z comes before é):

console.log("z" < "é"); // true?!?!

When you use Intl.Collator.prototype.compare, it uses an appropriate collation (loosely, ordering) for your locale according to the options you provide. That is likely to be different from the results for UTF-16 code unit values in many cases. For instance, even in an en locale, Collator returns the more reasonable result that z comes after é:

console.log(new Intl.Collator("en").compare("z", "é")); // 1

_ and / specifically

I can't tell you specifically why _ and / have a different order from their UTF-16 code units in the en locale you're using (and also the one that I'm using), whether it's en-US, en-UK, or something else. But it's not surprising to find that their order differs between ASCII and Unicode. (Remember, the UTF-16 code unit values for _ and / come from their ASCII values.)

ASCII's order was carefully designed in the early 1960s (there's a PDF that goes into wonderful detail about it), but largely without respect to linguistic ordering other than the ordering of A-Z and 0-9. / was in the original ASCII from 1963. _ wasn't added until 1967, in one of the available positions which was higher numerically than /. There's probably no more significant reason than that why _ is later/higher (numerically) than / in ASCII.

Unicode's collation order was carefully designed in the 1990s (and on through to today) with different goals (including linguistic ones), design requirements, and design constraints. As far as I can tell (I'm not a Unicode expert), Unicode's collation is described by TR10 and Part 5 of TR35. I haven't found a specific rationale for why _ is before / in the root collation (en uses the root collation). I'm sure it's in there somewhere. I did notice that one aspect of it seems to be grouping by category. The category of _ is "Connector punctuation" while the category of / is "Other punctuation." Perhaps that has something to do with why / is later than _.

But the fundamental answer is: They differ because ASCII's ordering and Unicode collation were designed with different constraints and requirements.

T.J. Crowder
  • 1,031,962
  • 187
  • 1,923
  • 1,875
  • @t-j-crowder - Thank you for clarifying it is UTF-16 instead of ASCII. Your explanation makes perfect sense for actual language characters used to make up words - like `z` and `é`. However, I think I am looking for an explanation for how the comparison is done between punctuation characters / special characters / symbol characters - specifically the underscore (`_`) and forward slash (`/`) characters for the `en-US` locale. – akaustav Oct 02 '21 at 00:05
  • @akaustav - I doubt you're going to find a smoking gun for why ASCII had `/` before `_` but Unicode has `_` before `/`. I've added a bit to the answer above to point you in adirectly, but fundamentally, it really doesn't matter. They're different because the two orderings were designed at different times, for different purposes, with different goals, requirements, and constraints. :-) – T.J. Crowder Oct 02 '21 at 09:11