2

I was wondering if it was possible to call a subroutine but not capture the result of that call.

For instance, let's say I want to recursively match and capture a balanced bracket {} structure like

{dfsdf{sdfdf{ {dfsdf} }}dfsf}

I could use this regex:

(^(?'nest'\{(?>[^{}]|(?&nest))*\}))

the first group is what I want to capture.

However my definition of 'nest':

(?'nest' ... )

and my recursive call to the 'nest' subroutine:

(?&nest)

are also capturing groups. I would like to make my regex more efficient and save space by not capturing those groups. Is there any way to do this?

edit: I expect it's impossible to not capture a subroutine definition, since its pattern needs to be captured for use elsewhere.


edit2:

I'm testing this regex with boost::regex as well as notepad++ regex. They actually appear define different capturing groups which is odd to me. I'm under the impression that they both use Perl regex by default.

Anyway, upon asking the question, I had the regex:

^\w+\s+[^\s]+\s+(?'header'(?'nest'\{(?>[^{}]|(?&nest))*\}))(?>\s+[^\s]+){5}\s+(?'data'(?>\{(?>[^{}]|(?&nest))*\}))\s+(?'class'(?>\{(?>[^{}]|(?&nest))*\}))

which I later realized contained needless characters that 'nest' already encapsulated. And I now have:

^\w+\s+[^\s]+\s+(?'nest'\{(?>[^{}]|(?&nest))*\})(?>\s+[^\s]+){5}\s+((?&nest))\s+((?&nest))

Notepad++ provides me with 3 capture groups when I do a replace statement

\\1: \1 \n \\2: \2 \n 3: \3 \n 4: \4

It tells me that "1 occurrence was replaced, next occurrence not found". The replacement has no text after the 4:, making me believe that the 4th capture group doesn't exist.

HOWEVER boost::regex_match returns an object with 6 positions:

0: metadata on the match

1: the entire match

2: the entire match

3: group1 from notepad++

4: group2 from notepad++

5: group3 from notepad++

I'm still trying to make send of positions 1 and 2.


edit3

I misunderstood yet another piece of the puzzle...

boost::cmatch.m_subs[i] != boost::cmatch[i]

I thought that they were equal. After some more debugging, it turns out that indexing into the object works exactly like the documentation says. But I incorrectly assumed that the object would contain a structure that mirrored what boost::cmatch[i] returned. It appears that boost::cmatch[i] first removes all entries from m_subs that have matched == false. The remaining entries line up with what boost::cmatch[i] returns.

Derek
  • 97
  • 8
  • If you need to match that `{...}` balanced substring at the start of the string, no, there is no way to use a non-capturing subroutine call. – Wiktor Stribiżew Jun 15 '17 at 16:32
  • Is there any reason for this? It seems like that feature would enhance regex functionality. – Derek Jun 15 '17 at 16:40
  • I'd rather call it premature optimization. – Wiktor Stribiżew Jun 15 '17 at 16:41
  • Posted the way to do it. But I see you've accepted there is _no_ way to do it, ahh.. –  Jun 15 '17 at 17:03
  • @sln: See your own answer: *Capture Groups = 1*. There *must* be a capturing group for the regex engine to recurse it. Be it in the DEFINE block or in the consuming subpattern, it must be defined. It is still kept in the memory. OP problem is not the resulting match object structure, but whether it is possible to recurse a non-capturing group. Which is impossible. – Wiktor Stribiżew Jun 15 '17 at 17:15
  • @WiktorStribiżew - Apples and oranges. => https://regex101.com/r/aT4TlM/1 There is no capture group internally, just a _function_. `May only be used to define functions. No matching is done in this group.` –  Jun 15 '17 at 17:19
  • @sln, exactly, apples and oranges. As I mentioned, it is not the match object structure that OP is interested in, but the internals. The regex engine still keeps the DEFINEd group in memory to know what to recurse. – Wiktor Stribiżew Jun 15 '17 at 17:20
  • @WiktorStribiżew - As I said, it is _not_ a capture group inside a DEFINE construct. Calls are a jump point into regex state code, has nothing to do with captures. –  Jun 15 '17 at 17:21
  • `boost::regex_match` matches the _entire input string_. If you want to replace globally, use the `boost::regex_replace` function. Can you post the target string used ? –  Jun 15 '17 at 17:46
  • I wish I could, but I don't have the power to do so. But I believe this was a conceptual question anyway and am more than satisfied with the answers given without seeing the real text. – Derek Jun 15 '17 at 17:51

3 Answers3

1

A subroutine call is a mechanism that recurses subpatterns. The regex engine must know what group to recurse, and that is why it requires either its ID (if the group is numbered) or name (if it is a named group, as in your case). Non-capturing groups DO NOT store references to these group patterns, and thus, you cannot reference them inside a subroutine call.

The only way to not use a capturing group in a subroutine call is to use a shortcut to the whole pattern, (?R). BUT it is not an option when you need to recurse a part of the pattern (as in your case, where you want to match a start of string, and only recurse the pattern part after ^.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • thank you for the quick reply. I extended my question to reflect that I had a deeper misunderstanding. – Derek Jun 15 '17 at 16:47
  • @Derek As you see, even in a DEFINE block a capturing group is used to define a subpattern. The only difference is that that capture is not returned in the match results. **You can't avoid defining a pattern as a capturing group if you need to recurse it**, you can only avoid having it inside a match result. – Wiktor Stribiżew Jun 15 '17 at 18:37
1

Any subroutine placed into a (?(DEFINE).) construct won't capture anything.

If you just want to avoid having any captures, it's done like this

https://regex101.com/r/aT4TlM/1

Note the -

Subpattern definition construct (?(DEFINE)(?'nest'\{(?>[^{}]|(?&nest))*\}))
May only be used to define functions. No matching is done in this group.

^(?&nest)(?(DEFINE)(?'nest'\{(?>[^{}]|(?&nest))*\}))

And since you have that BOS anchor there ^ it's the only way.
I.e. (?R) is not an option.

Expanded

 ^ 
 (?&nest) 

 (?(DEFINE)

      (?'nest'                      # (1 start)
           \{
           (?>
                [^{}] 
             |  (?&nest) 
           )*
           \}
      )                             # (1 end)
 )

Output

  **  Grp 0        -  ( pos 0 , len 29 ) 
 {dfsdf{sdfdf{ {dfsdf} }}dfsf}  
  **  Grp 1 [nest] -  NULL 

Metrics

----------------------------------
 * Format Metrics
----------------------------------
Atomic Groups       =   1

Capture Groups      =   1
       Named        =   1

Recursions          =   2

Conditionals        =   1
       DEFINE       =   1

Character Classes   =   1
1

Re: Edit2

This regex ^\w+\s+[^\s]+\s+(?'nest'\{(?>[^{}]|(?&nest))*\})(?>\s+[^\s]+){5}\s+((?&nest))\s+((?&nest))

can be seen when formatted to contain only 3 groups.

 ^ \w+ \s+ [^\s]+ \s+ 
 (?'nest'                      # (1 start)
      \{
      (?>
           [^{}] 
        |  (?&nest)
      )*
      \}
 )                             # (1 end)
 (?> \s+ [^\s]+ ){5}
 \s+ 
 ( (?&nest) )                  # (2)
 \s+ 
 ( (?&nest) )                  # (3)

What is it you want to do with this ?

  • It works EXACTLY as intended... my original regex was incorrect, but I assumed that I had written it correctly and that improper grouping was a symptom of the underlying software, not my regex. This was compounded by my misinterpretation of the boost::cmatch object as it appears while debugging in VS2013. – Derek Jun 15 '17 at 17:48
  • Yeah, it looks like from the docs `Objects of type sub_match may only be obtained by subscripting an object of type match_results.` the sub_match, obtained from match_results[I], contain methods to get stuff about the group. Like iterators first/last ( for string(m[2].first,m[2].second) ), bools like matched, and other direct string conversion like m[2].basic_string(). And also position and length. –  Jun 15 '17 at 18:08
  • When I use sub_match, sometimes I go deeper, down to this level `(int)_m[ i ].first._Ptr` to directly compare to other lengths and locations, or specific unrelated offsets. –  Jun 15 '17 at 18:14