1

I thought I knew about regex... Here's the situation:

N-U0 U0-M1
M1-T9 T9-R10 R10-E19
E19-L100 L100-B

I have a String that contains groups (let's call them transitions) separated by whitespace (may or may not be line breaks, I'm treating both equally; also, may be one or more characters). Each group is composed of two groups (let's call them exiting and entering) separated by a hyphen. Each of these is composed of either a single character (N or B, respectively) or a specific character and a one-or-many-digits number.

I want to run a regex match that will give me one object for each transition and then, for each object, I want access to each part of the transition by means of named capture groups.

These are the regexes I've written:

static RegExp regex = RegExp(
  r'(?<exitingN>N)|((?<exitingF>[UMTREL]{1})(?<exitingNumber>[0-9]+))-(?<enteringB>B)|((?<enteringF>[UMTREL]{1})(?<enteringNumber>[0-9]+))\s+',
);

static RegExp exitingRegex = RegExp(
  r'(?<exitingN>N)|((?<exitingF>[UMTREL]{1})(?<exitingNumber>[0-9]+))-',
);

static RegExp enteringRegex = RegExp(
  r'-(?<enteringB>B)|((?<enteringF>[UMTREL]{1})(?<enteringNumber>[0-9]+))',
);

When I run

final matchList = regex.allMatches(
  "N-U0 U0-M1\nM1-T9 T9-R10 R10-E19\nE19-L100 L100-B\n",
);

It doesn't work as I expect it to. It matches the first N, then the first U0, then the first M1, and so on until the first L100 and the B. I was expecting it to match N-U0, then U0-M1 and so on. At least matchList.elementAt(0).namedGroup("exitingN") etc works, but I wanted the exiting and the entering parts together.

I tried to add the regex inside another group and I tried both with and without ?: (to make it non-capturing), plus a few other tests, I think, but nothing worked.

Then I tested with exitingRegex only and it worked as expected, matching every exiting. However, enteringRegex didn't work. It matched every exiting and every entering except for N.

The only way I managed to make it work was to match with exitingRegex and then, for the entering, I had to first use "N-U0 U0-M1\nM1-T9 T9-R10 R10-E19\nE19-L100 L100-B\n".replaceAll(exitingRegex, "",) and then match with enteringRegex but without the leading hyphen. This way, I got the exiting and the entering separately, which I have to join later by index.

What's going on?

Thanks in advance.

GuiRitter
  • 697
  • 1
  • 11
  • 20
  • @InSync Strange. This is what was matched with your change: `-U0`, `-M1`, `-T9`, `-R10`, `-E19`, `-L100` and `L100-B`. At least it got the amount right... – GuiRitter Jul 03 '23 at 16:34
  • @InSync Of course, I have to apply that change to both sides... It's working now, thanks! Add it as an answer so I can accept it. – GuiRitter Jul 03 '23 at 16:40
  • 1
    Converted to an answer. – InSync Jul 03 '23 at 17:09

2 Answers2

1

To limit the branches separated by |, wrap them in a group. This group can be a capturing (()) or non-capturing group ((?:)), depends on what you need. That said, your regex should look like this:

(?:
  (?<exitingN>N)
|
  ((?<exitingF>[UMTREL])(?<exitingNumber>[0-9]+))
)
-
(?:
  (?<enteringB>B)
|
  ((?<enteringF>[UMTREL])(?<enteringNumber>[0-9]+))
)

For an input of U0-M1, this regex matches and returns the following groups:

  • 0: U0-M1
  • 2: U0
  • exitingF: U
  • exitingNumber: 0
  • ...and so on.

Do note that I removed those unnecessary {1} because an expression always match 1 instance of itself by default.

Try it on regex101.com.

InSync
  • 4,851
  • 4
  • 8
  • 30
1

If you have nothing against parsers, then you can get the necessary result in 10 minutes.
I mean 10 minutes spent writing a parser.
The parser code is easier to understand and improve.

import 'package:parser_combinator/parser/digit.dart';
import 'package:parser_combinator/parser/many1.dart';
import 'package:parser_combinator/parser/predicate.dart';
import 'package:parser_combinator/parser/skip_while.dart';
import 'package:parser_combinator/parser/tag.dart';
import 'package:parser_combinator/parser/take_while_m_n.dart';
import 'package:parser_combinator/parser/terminated.dart';
import 'package:parser_combinator/parser/tuple.dart';
import 'package:parser_combinator/parsing.dart';

void main(List<String> args) {
  const part = Tuple2(TakeWhileMN(1, 1, isAlpha), Digit());
  const element = Tuple3(part, Tag('-'), part);
  const groups = Many1(Terminated(element, SkipWhile(isWhitespace)));
  final result = parseString(groups.parse, input)
      .map((e) => (entering: e.$1, exiting: e.$3))
      .toList();
  print(result.join('\n'));
  final element4 = result[3];
  print('$element4');
  print('${element4.entering}  ${element4.exiting}');
  print('${element4.entering.$1} ${element4.entering.$2}');
}

const input = '''N-U0 U0-M1
M1-T9 T9-R10 R10-E19
E19-L100 L100-B''';

Output:

(entering: (N, ), exiting: (U, 0))
(entering: (U, 0), exiting: (M, 1))
(entering: (M, 1), exiting: (T, 9))
(entering: (T, 9), exiting: (R, 10))
(entering: (R, 10), exiting: (E, 19))
(entering: (E, 19), exiting: (L, 100))
(entering: (L, 100), exiting: (B, ))
(entering: (T, 9), exiting: (R, 10))
(T, 9)  (R, 10)
T 9

In addition, it parses quite quickly.
On a fairly old computer, 75,000 iterations per second.

void main(List<String> args) {
  const count = 100000;
  final sw = Stopwatch();
  sw.start();
  for (var i = 0; i < count; i++) {
    const part = Tuple2(TakeWhileMN(1, 1, isAlpha), Digit());
    const element = Tuple3(part, Tag('-'), part);
    const groups = Many1(Terminated(element, SkipWhile(isWhitespace)));
    final result = parseString(groups.parse, input);
  }

  sw.stop();
  print('Iterations: $count, time: ${sw.elapsedMilliseconds / 1000} sec');
}

Output:

Iterations: 100000, time: 1.305 sec
mezoni
  • 10,684
  • 4
  • 32
  • 54
  • Very interesting! There's no need to change the code that's already working but I'll save this for future reference. – GuiRitter Aug 04 '23 at 19:18