1

I wrote the following regex: /\D(?!.*\D)|^-?|\d+/g

I think it should work this way:

\D(?!.*\D)    # match the last non-digit
|             # or
^-?           # match the start of the string with optional literal '-' character
|             # or
\d+           # match digits

But, it doesn't:

var arrTest = '12,345,678.90'.match(/\D(?!.*\D)|^-?|\d+/g);
console.log(arrTest);

var test = arrTest.join('').replace(/[^\d-]/, '.');
console.log(test);

However, when playing it with PCRE(php)-flavour online at Regex101. It works as I described.

I don't know if I think it should work one way it doesn't work. Or if there are some pattern not allowed in javascript regex-flavour.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Washington Guedes
  • 4,254
  • 3
  • 30
  • 56

2 Answers2

3

JS works differently than PCRE. The point is that the JS regex engine does not handle zero-length matches well, the index is just manually incremented and the next character after a zero-length match is skipped. The ^-? can match an empty string, and it matches the 12,345,678.90 start, skipping 1.

If we have a look at the String#match documentation, we will see that each call to match with a global regex increases the regex object's lastIndex after the zero-length match is found:

  1. Else, global is true
    a. Call the [[Put]] internal method of rx with arguments "lastIndex" and 0.
    b. Let A be a new array created as if by the expression new Array() where Array is the standard built-in constructor with that name.
    c. Let previousLastIndex be 0.
    d. Let n be 0.
    e. Let lastMatch be true.
    f. Repeat, while lastMatch is true
        i. Let result be the result of calling the [[Call]] internal method of exec with rx as the this value and argument list containing S.
        ii. If result is null, then set lastMatch to false.
        iii. Else, result is not null
            1. Let thisIndex be the result of calling the [[Get]] internal method of rx with argument "lastIndex".
            2. If thisIndex = previousLastIndex then
                a. Call the [[Put]] internal method of rx with arguments "lastIndex" and thisIndex+1.
                b. Set previousLastIndex to thisIndex+1.

So, the matching process goes from 8a till 8f initializing the auxiliary structures, then a while block is entered (repeated until lastMatch is true, an internal exec command matches the empty space at the start of the string (8fi -> 8fiii), and as the result is not null, thisIndex is set to the lastIndex of the previous successful match, and as the match was zero-length (basically, thisIndex = previousLastIndex), the previousLastIndex is set to thisIndex+1 - which is skipping the current position after a successful zero-length match.

You may actually use a simpler regex inside a replace method and use a callback to use appropriate replacements:

var res = '-12,345,678.90'.replace(/(\D)(?!.*\D)|^-|\D/g, function($0,$1) {
   return $1 ? "." : "";
});
console.log(res);

Pattern details:

  • (\D)(?!.*\D) - a non-digit (captured into Group 1) that is not followed with 0+ chars other than a newline and another non-digit
  • | - or
  • ^- - a hyphen at the string start
  • | - or
  • \D - a non-digit

Note that here you do not even have to make the hyphen at the start optional.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Took me 16 minutes to add (and format) the ECMA reference with step-by-step explanation why this happens. Hope it explains this weird JS *global* regex matching behavior. Note that PCRE just does not bluntly increments the position after a zero-length match, it re-checks the next character correctly. – Wiktor Stribiżew Aug 09 '16 at 21:57
  • Thanks :) ... I ended up with `.replace(/(^-)|\D(?=.*\D)|(\D)/g, function($0, $1, $2) { return $1 || ($2 ? '.' : ''); });` to avoid replacing the negative sign. – Washington Guedes Aug 10 '16 at 02:22
  • If you do not need to replace that minus, why use the alternative branch at all :) ? Just remove the `(^-)|` from the pattern and remove `$2` from the callback function and adjust its body. – Wiktor Stribiżew Aug 10 '16 at 06:25
2

You can reorder your alternation patterns and use this in JS to make it work:

var arrTest = '12,345,678.90'.match(/\D(?!.*\D)|\d+|^-?/g);
console.log(arrTest);

var test = arrTest.join('').replace(/\D/, '.');

console.log(test);

//=> 12345678.90

RegEx Demo

This is the difference between Javascript and PHP(PCRE) regex behavior.

In Javascript:

'12345'.match(/^|.+/gm)
//=> ["", "2345"]

In PHP:

preg_match_all('/^|.+/m', '12345', $m);
print_r($m);
Array
(
    [0] => Array
        (
            [0] =>
            [1] => 12345
        )
    )

So when you match ^ in Javascript, regex engine moves one position ahead and anything after alternation | matches from 2nd position omwards in input.

anubhava
  • 761,203
  • 64
  • 569
  • 643
  • 1
    It is because when you match `^` in JS regex engine moves one position ahead and anything after alternation `|` matches from 2nd position in input. Test with `'12345'.match(/^|.+/gm)` which will give `["", "2345"]` – anubhava Aug 09 '16 at 21:28