Sequence of logical OR in ES6/Unicode regular expression in Chrome ✗ vs Firefox ✓

Question

Consider the following Unicode-heavy regular expression (emoji standing in for non-ASCII and extra-BMP characters):

''.match(/||/ug)

Firefox returns [ "", "", "", "", "", "" ] .

Chrome 52.0.2743.116 and Node 6.4.0 both return null! It doesn’t seem to care if I put the string in a variable and do str.match(…), nor if I build a RegExp object via new RegExp('||', 'gu').

(Chrome is ok with just ORing two sequences: ''.match(/|/ug) is ok. It’s also ok with non-Unicode: 'aakkzzkkaa'.match(/aa|kk|zz/ug) works.)

Am I doing something wrong? Is this a Chrome bug? The ECMAScript compatibility table says I should be ok with Unicode regexps.

(PS: The three emoji used in this example are just stand-ins. In my application, they’ll be arbitrary but distinct strings. But I wonder if the fact that ''.match(/[]/ug) works in Chrome is relevant?)

Update Marked fixed on 12 April 2017 in Chromium and downstream (including Chrome and Node).

Maybe I'm just conservative, but this would be easier to read with `foo`, `bar`, and `baz` or `A`, `B`, and `C`. Plus a lot of fonts still don't do all the emojis, so if someone is missing two of them they will see them both as a square -- or worse all three. — Captain Man, Aug 25 '16 at 18:44
@CaptainMan the world speaks many languages, many of which are written with non-ASCII or (gasp!) extra-BMP characters. I’m using emoji as a standin for those characters. (Also I indicate in the post that the same example works with ASCII, so it’s a Unicode problem.) Updating title to emphasize Unicode. — Ahmed Fasih, Aug 25 '16 at 18:45
I see now part of the point was for unicode (missed it at first). I still think more "vanilla" unicode characters would be better than emojis. — Captain Man, Aug 25 '16 at 18:47
Note that `''.match(/[]/ug)` works in Chrome. Alternation *does* break the regex for some reason. — Wiktor Stribiżew, Aug 25 '16 at 18:48
Folks. I want to use this code with, say, Tangut characters, new in Unicode 9. I think if this code breaks on emoji, it’ll break in my application. — Ahmed Fasih, Aug 25 '16 at 18:49
@WiktorStribiżew thanks. In my application, I’ll be searching for multi-character strings, so I can’t use `[a-z]`-like character sequences, but maybe the fact that that works but ORs don’t will give someone a hint as to the solution. — Ahmed Fasih, Aug 25 '16 at 18:51
Check that regex is actually reading for the correct number of distinct characters. I think I've seen regex read emoji / unicode characters as two separate characters. I'll try to find a reference. — Jecoms, Aug 25 '16 at 18:55
@Jecoms definitely—JavaScript’s UTF-16 representation means `''.length` evaluates to 6, not 3, and other shenanigans. However, I thought ES2015 and that trailing `u` regexp modifier would automagic everything into being Unicode-friendly ([reference](https://babeljs.io/docs/learn-es2015/#unicode)). Looking into this further… — Ahmed Fasih, Aug 25 '16 at 18:56
@kennytm thanks!!! That increases my confidence that it’s a bug in Chrome-land. I’ll use that for now and file a bug report. — Ahmed Fasih, Aug 25 '16 at 19:01
Filed… https://bugs.chromium.org/p/chromium/issues/detail?id=641091 — Ahmed Fasih, Aug 25 '16 at 19:12
@AhmedFasih, without the "u"-flag it does also work in chrome *(52.0.2743.116)* for me — Thomas, Aug 25 '16 at 19:23
@Thomas well I’ll be a monkey’s uncle, you’re absolutely right. Thanks! — Ahmed Fasih, Aug 25 '16 at 19:28
unless you use multiplier `''.match(/|{2}|/g)` -> null `{1}` and `{1,}` seem to work, I assume they are translated into `?` and `+`. I assume without the "u"-flag `{2}` is interpreted as `\ud83c\udf66{2}`, wich would explain the behaviour. *SO doesn't print my regex right, when editing it ends with `{fruit}/g)`* — Thomas, Aug 25 '16 at 19:36
This is certainly the most amusing question I've read all day. — zzzzBov, Aug 25 '16 at 20:04
ES6 2014: According to [Mozilla](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp), Chrome and Firefox support _codepoint's_ with the `//u` flag. Using codepoint units imply that internally, the regex _and_ target source are converted to either all UTF-8 or all UTF-32. Where in the regex the codepoint construct is `\u{xxxxxx}`. Literals, it says should be ok to put in classes _and_ can be quantified outside classes (see [here](https://mathiasbynens.be/notes/es6-unicode-regex)). Until it's work out, don't use `//u`: handle utf-16 problems (surrogates). — , Aug 26 '16 at 15:55
Another data point: after transpiling the snippet using regexpu (or Babel/Traceur, which use regexpu) and executing it, the output matches that of Firefox. https://mothereff.in/regexpu#input=console.log(%0A++%27%F0%9F%8D%A4%F0%9F%8D%A6%F0%9F%8D%8B%F0%9F%8D%8B%F0%9F%8D%A6%F0%9F%8D%A4%27.match(/%F0%9F%8D%A4%7C%F0%9F%8D%A6%7C%F0%9F%8D%8B/ug)%0A) — Mathias Bynens, Aug 30 '16 at 09:02
@Thomas Your assumption is correct. See [https://mathiasbynens.be/notes/es6-unicode-regex#impact-quantifiers](https://mathiasbynens.be/notes/es6-unicode-regex#impact-quantifiers). — Mathias Bynens, Aug 30 '16 at 09:02

georg · Accepted Answer · 2016-08-25T20:13:25.627

Without the u flag, your regexp works, and this is no wonder, since in the BMP (=no "u") mode it compares 16-bit "units" to 16-bit "units", that is, a surrogate pair to another surrogate pair.

The behaviour in the "u" mode (which is supposed to compare codepoints and not units) looks indeed like a Chrome bug, in the meantime you can enclose each alternative in a group, which seems to work fine:

m = ''.match(/()|()|()/ug)
console.log(m)

// note that the groups must be capturing!
// this doesn't work:

m = ''.match(/(?:)|(?:)|(?:)/ug)
console.log(m)

And here's a quick proof that more than two SMP alternatives are broken in the u mode:

// insert a whatever range 
// from https://en.wikipedia.org/wiki/Plane_(Unicode)#Supplementary_Multilingual_Plane
var range = '11300-1137F';

range = range.split('-').map(x => parseInt(x, 16))

var chars = [];
for (var i = range[0]; i <= range[1]; i++) {
    chars.push(String.fromCodePoint(i))
}

var str = chars.join('');

while(chars.length) {
    var re = new RegExp(chars.join('|'), 'u')
    if(str.match(re))
        console.log(chars.length, re);
    chars.pop();
}

In Chrome, it only logs the last two regexes (2 and 1 alts).

Thomas · Answer 2 · 2016-08-25T20:14:31.547

without the "u"-flag it does also work in chrome (52.0.2743.116) for me

well u-flag seems to be broken

unless you use multiplier ''.match(/|{2}|/g) -> null {1} and {1,} seem to work, I assume they are translated into ? and +. I assume without the "u"-flag {2} is interpreted as \ud83c\udf66{2}, wich would explain the behaviour.

just tested with (?:){2} this seems to work right. I guess this confirms my assumption about the multiplier.

here a quick fix for that:

//a utility I usually have in my codes
var replace = (pattern, replacement) => value => String(value).replace(pattern, replacement);

var fixRegexSource = replace(
    /[\ud800-\udbff][\udc00-\udfff]/g, 
    //"(?:$&)" //not sure wether this might still be buggy
    //that's why I convert it into the unicode-syntax,
    //this can't be misinterpreted
    c => `(?:\\u${c.charCodeAt(0).toString(16)}\\u${c.charCodeAt(1).toString(16)})`
);

var fixRegex = regex => new RegExp(
    fixRegexSource(regex.source), 
    regex.flags.replace("u", "")
);

sry, didn't come up with better function-names

Sequence of logical OR in ES6/Unicode regular expression in Chrome ✗ vs Firefox ✓

2 Answers2