6

Let m be of type std::smatch . Suppose there is an unmatched group i. What is m.position(i) ? For that matter, what is m[i]?

For example, consider

std::regex re {"^(a+)|(b+)"};
string target="aa";
std::smatch m;
std::regex_search(target,m,re);
cout<<"m[2] is: "<<m[2]<<" at position: "<<m.position(2);

I cannot figure out from the reference https://en.cppreference.com/w/cpp/regex/match_results/position what is guaranteed to happen here and why.

kdog
  • 1,583
  • 16
  • 28
  • It would be faster to write a test than to wait for someone who absolutely knows. – Joseph Larson Feb 02 '21 at 20:35
  • 1
    What do you get when you run that code? – Barmar Feb 02 '21 at 20:35
  • 2
    @Barmar What possible difference would running the code make? I do not care what a particular set of compilers does; I care here what the standard requires. – kdog Feb 02 '21 at 21:01
  • How the regex should behave depends on the arguments you pass to the constructor for `std::regex`. In this case you're leaning on the default, which is ECMAScript, for which you can find some reference [here](https://en.cppreference.com/w/cpp/regex/ecmascript). In a nutshell, you have 3 groups: the overall match, the (a+) and (b+) – AndyG Feb 02 '21 at 21:01
  • @AndyG exactly and I am asking the standard requires the position of the (b+) match to be, since it fails to match. – kdog Feb 02 '21 at 21:02
  • @AndyG His point is that `(b+)` doesn't match anything, so what do all the functions that reference submatch 2 do? – Barmar Feb 02 '21 at 21:03
  • I think you're supposed to use `m[2].matched` to tell if it matched anything. – Barmar Feb 02 '21 at 21:08
  • And I'm saying that the behavior will be dependent on the (optional) flags used to construct `std::regex` is constructed. For example [nosubs](https://en.cppreference.com/w/cpp/regex/syntax_option_type) will always result in no matches being stored in `smatch` – AndyG Feb 02 '21 at 21:08
  • @AndyG use the flags in the example please. – kdog Feb 02 '21 at 21:10
  • @Barmar thanks that is probably correct. But what happens if I don't check `.matched`? – kdog Feb 02 '21 at 21:11
  • @kdog: Given your exact code and input ("aa"), it's clear the 3rd group is not matched, so it will be an empty string. – AndyG Feb 02 '21 at 21:13
  • It doesn't seem to be specified. I'd expect an invalid index like `-1`. – Barmar Feb 02 '21 at 21:13
  • @AndyG `position` isn't a string, it's an index. – Barmar Feb 02 '21 at 21:13
  • @Barmar: OP should check `m[2].empty()` – AndyG Feb 02 '21 at 21:15
  • 1
    @AndyG I think you mean `m[2].str().empty()` – Barmar Feb 02 '21 at 21:16
  • But you can also get an empty string when you match an empty substring, such as when using a quantifier. E.g. with `(a*)(b*)` you'll get an empty string with the position being `2`. – Barmar Feb 02 '21 at 21:17
  • 3
    @JosephLarson: Results from tests gives hints, no guaranties. C++ has lot of UB, unspecified/implementation specific behaviors to avoid to rely only on tests. – Jarod42 Feb 03 '21 at 00:42

1 Answers1

8

According to the C++17 Standard:

28.10 Class template match_results [ re.results ]

4 The sub_match object stored at index 0 represents sub-expression 0, i.e., the whole match. In this case the sub_match member matched is always true. The sub_match object stored at index n denotes what matched the marked sub-expression n within the matched expression. If the sub-expression n participated in a regular expression match then the sub_match member matched evaluates to true, and members first and second denote the range of characters [first,second) which formed that match. Otherwise matched is false, and members first and second point to the end of the sequence that was searched.

[ Note: The sub_match objects representing different sub-expressions that did not participate in a regular expression match need not be distinct. — end note ]

Now m.position(n) returns (*this)[n].first.

Given that "[If] matched is false, [then] members first and second point to the end of the sequence that was searched" ...

This means m.position(n) should point "to the end of the sequence that was searched".

Cody Gray - on strike
  • 239,200
  • 50
  • 490
  • 574
Galik
  • 47,303
  • 4
  • 80
  • 117
  • That looks like a great answer and exactly what I was looking for. I do have a question: why wasn't this, or is it, somewhere convenient in cppreference? Should I not be using cppreference and instead only use the standard? I've been cheating by using cppreference all my life. – kdog Feb 02 '21 at 21:23
  • @kdog I find *cppreference* is much easier to search and understand than the standard. I would probably code at the pace of a snail if I went to the Standard for everything - especially basic API stuff. This is a little obscure as not many people are likely to need to know this information. Maybe that's why *cppreference* overlooked it? It is usually pretty accurate but no cooperatively edited website is going to be 100% all the time. You pay your money and take your chance! – Galik Feb 02 '21 at 21:28
  • 6
    The reason cppreference is easier to read is precisely because it doesn't go into as much detail as the standard needs to. – Barmar Feb 03 '21 at 22:26