13

Consider the following Unicode-heavy regular expression (emoji standing in for non-ASCII and extra-BMP characters):

''.match(/||/ug)

Firefox returns [ "", "", "", "", "", "" ] .

Chrome 52.0.2743.116 and Node 6.4.0 both return null! It doesn’t seem to care if I put the string in a variable and do str.match(…), nor if I build a RegExp object via new RegExp('||', 'gu').

(Chrome is ok with just ORing two sequences: ''.match(/|/ug) is ok. It’s also ok with non-Unicode: 'aakkzzkkaa'.match(/aa|kk|zz/ug) works.)

Am I doing something wrong? Is this a Chrome bug? The ECMAScript compatibility table says I should be ok with Unicode regexps.

(PS: The three emoji used in this example are just stand-ins. In my application, they’ll be arbitrary but distinct strings. But I wonder if the fact that ''.match(/[]/ug) works in Chrome is relevant?)


Update Marked fixed on 12 April 2017 in Chromium and downstream (including Chrome and Node).

Ahmed Fasih
  • 6,458
  • 7
  • 54
  • 95
  • Maybe I'm just conservative, but this would be easier to read with `foo`, `bar`, and `baz` or `A`, `B`, and `C`. Plus a lot of fonts still don't do all the emojis, so if someone is missing two of them they will see them both as a square -- or worse all three. – Captain Man Aug 25 '16 at 18:44
  • @CaptainMan the world speaks many languages, many of which are written with non-ASCII or (gasp!) extra-BMP characters. I’m using emoji as a standin for those characters. (Also I indicate in the post that the same example works with ASCII, so it’s a Unicode problem.) Updating title to emphasize Unicode. – Ahmed Fasih Aug 25 '16 at 18:45
  • I see now part of the point was for unicode (missed it at first). I still think more "vanilla" unicode characters would be better than emojis. – Captain Man Aug 25 '16 at 18:47
  • 3
    Note that `''.match(/[]/ug)` works in Chrome. Alternation *does* break the regex for some reason. – Wiktor Stribiżew Aug 25 '16 at 18:48
  • 1
    Folks. I want to use this code with, say, Tangut characters, new in Unicode 9. I think if this code breaks on emoji, it’ll break in my application. – Ahmed Fasih Aug 25 '16 at 18:49
  • @WiktorStribiżew thanks. In my application, I’ll be searching for multi-character strings, so I can’t use `[a-z]`-like character sequences, but maybe the fact that that works but ORs don’t will give someone a hint as to the solution. – Ahmed Fasih Aug 25 '16 at 18:51
  • Check that regex is actually reading for the correct number of distinct characters. I think I've seen regex read emoji / unicode characters as two separate characters. I'll try to find a reference. – Jecoms Aug 25 '16 at 18:55
  • @Jecoms definitely—JavaScript’s UTF-16 representation means `''.length` evaluates to 6, not 3, and other shenanigans. However, I thought ES2015 and that trailing `u` regexp modifier would automagic everything into being Unicode-friendly ([reference](https://babeljs.io/docs/learn-es2015/#unicode)). Looking into this further… – Ahmed Fasih Aug 25 '16 at 18:56
  • 4
    `''.match(/||{1}/ug)` works. I say it's a bug. – kennytm Aug 25 '16 at 18:58
  • @kennytm thanks!!! That increases my confidence that it’s a bug in Chrome-land. I’ll use that for now and file a bug report. – Ahmed Fasih Aug 25 '16 at 19:01
  • 4
    Filed… https://bugs.chromium.org/p/chromium/issues/detail?id=641091 – Ahmed Fasih Aug 25 '16 at 19:12
  • 1
    @AhmedFasih, without the "u"-flag it does also work in chrome *(52.0.2743.116)* for me – Thomas Aug 25 '16 at 19:23
  • @Thomas well I’ll be a monkey’s uncle, you’re absolutely right. Thanks! – Ahmed Fasih Aug 25 '16 at 19:28
  • 1
    unless you use multiplier `''.match(/|{2}|/g)` -> null `{1}` and `{1,}` seem to work, I assume they are translated into `?` and `+`. I assume without the "u"-flag `{2}` is interpreted as `\ud83c\udf66{2}`, wich would explain the behaviour. *SO doesn't print my regex right, when editing it ends with `{fruit}/g)`* – Thomas Aug 25 '16 at 19:36
  • 1
    This is certainly the most amusing question I've read all day. – zzzzBov Aug 25 '16 at 20:04
  • 1
    ES6 2014: According to [Mozilla](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp), Chrome and Firefox support _codepoint's_ with the `//u` flag. Using codepoint units imply that internally, the regex _and_ target source are converted to either all UTF-8 or all UTF-32. Where in the regex the codepoint construct is `\u{xxxxxx}`. Literals, it says should be ok to put in classes _and_ can be quantified outside classes (see [here](https://mathiasbynens.be/notes/es6-unicode-regex)). Until it's work out, don't use `//u`: handle utf-16 problems (surrogates). –  Aug 26 '16 at 15:55
  • 1
    Another data point: after transpiling the snippet using regexpu (or Babel/Traceur, which use regexpu) and executing it, the output matches that of Firefox. https://mothereff.in/regexpu#input=console.log(%0A++%27%F0%9F%8D%A4%F0%9F%8D%A6%F0%9F%8D%8B%F0%9F%8D%8B%F0%9F%8D%A6%F0%9F%8D%A4%27.match(/%F0%9F%8D%A4%7C%F0%9F%8D%A6%7C%F0%9F%8D%8B/ug)%0A) – Mathias Bynens Aug 30 '16 at 09:02
  • @Thomas Your assumption is correct. See [https://mathiasbynens.be/notes/es6-unicode-regex#impact-quantifiers](https://mathiasbynens.be/notes/es6-unicode-regex#impact-quantifiers). – Mathias Bynens Aug 30 '16 at 09:02

2 Answers2

3

Without the u flag, your regexp works, and this is no wonder, since in the BMP (=no "u") mode it compares 16-bit "units" to 16-bit "units", that is, a surrogate pair to another surrogate pair.

The behaviour in the "u" mode (which is supposed to compare codepoints and not units) looks indeed like a Chrome bug, in the meantime you can enclose each alternative in a group, which seems to work fine:

m = ''.match(/()|()|()/ug)
console.log(m)

// note that the groups must be capturing!
// this doesn't work:

m = ''.match(/(?:)|(?:)|(?:)/ug)
console.log(m)

And here's a quick proof that more than two SMP alternatives are broken in the u mode:

// insert a whatever range 
// from https://en.wikipedia.org/wiki/Plane_(Unicode)#Supplementary_Multilingual_Plane
var range = '11300-1137F';

range = range.split('-').map(x => parseInt(x, 16))

var chars = [];
for (var i = range[0]; i <= range[1]; i++) {
    chars.push(String.fromCodePoint(i))
}

var str = chars.join('');

while(chars.length) {
    var re = new RegExp(chars.join('|'), 'u')
    if(str.match(re))
        console.log(chars.length, re);
    chars.pop();
}

In Chrome, it only logs the last two regexes (2 and 1 alts).

georg
  • 211,518
  • 52
  • 313
  • 390
2

without the "u"-flag it does also work in chrome (52.0.2743.116) for me

well u-flag seems to be broken

unless you use multiplier ''.match(/|{2}|/g) -> null {1} and {1,} seem to work, I assume they are translated into ? and +. I assume without the "u"-flag {2} is interpreted as \ud83c\udf66{2}, wich would explain the behaviour.

just tested with (?:){2} this seems to work right. I guess this confirms my assumption about the multiplier.

here a quick fix for that:

//a utility I usually have in my codes
var replace = (pattern, replacement) => value => String(value).replace(pattern, replacement);

var fixRegexSource = replace(
    /[\ud800-\udbff][\udc00-\udfff]/g, 
    //"(?:$&)" //not sure wether this might still be buggy
    //that's why I convert it into the unicode-syntax,
    //this can't be misinterpreted
    c => `(?:\\u${c.charCodeAt(0).toString(16)}\\u${c.charCodeAt(1).toString(16)})`
);

var fixRegex = regex => new RegExp(
    fixRegexSource(regex.source), 
    regex.flags.replace("u", "")
);

sry, didn't come up with better function-names

Thomas
  • 11,958
  • 1
  • 14
  • 23