5

I am trying to understand Non-capturing groups in Regex.

If I have the following input:

He hit the ball.  Then he ran.  The crowd was cheering!  How did he feel?  I felt so energized!

If I want to extract the first word in each sentence, I was trying to use the match pattern:

^(\w+\b.*?)|[\.!\?]\s+(\w+)

That puts the desired output in the submatch.

Match   $1
He      He  
. Then  Then
. The   The
! How   How
? I     I

But I was thinking that using non-capturing groups, I should be able to get them back in the match.

I tried:

^(?:\w+\b.*?)|(?:[\.!\?]\s+)(\w+)

and that yielded:

Match   $1
He  
. Then  Then
. The   The
! How   How
? I     I

and ^(?:\w+\b.*?)|(?:[.!\?]\s+)\w+

yielded:

Match
He
. Then
. The
! How
? I

What am I missing?

(I am testing my regex using RegExLib.com, but will then transfer it to VBA).

MonroeGA
  • 145
  • 1
  • 2
  • 8
  • simple question. Do you know what groups are and why do we need them? – VladL Jan 09 '13 at 19:01
  • 2
    Non-capturing group means that it will not store the text matched by the pattern in the group. It doesn't mean that the text is not matched by the whole regex. You will need zero-width look-around if you don't want the match result of the whole regex to contain the parts you don't need. The trick may not work all the time, so using group as you have been doing is an acceptable solution. – nhahtdh Jan 09 '13 at 19:40
  • @MonroeGA Please accept an answer, thanks. – Madbreaks Jan 26 '17 at 18:20

3 Answers3

7

A simple example against string "foo":

(f)(o+)

Will yield $1 = 'f' and $2 = 'oo';

(?:f)(o+)

Here, $1 = 'oo' because you've explicitly said not to capture the first matching group. And there is no second matching group.

For your scenario, this feels about right:

(?:(\w+).*?[\.\?!] {2}?)

Note that the outermost group is a non-capturing group, while the inner group (the first word of the sentence) is capturing.

Madbreaks
  • 19,094
  • 7
  • 58
  • 72
  • Thanks for your help. @Madbreaks However, if I put your expression into the tester, I don't get back the last "I". I get only 4 results. Otherwise, the results are the same as my original expression, the desired items come back in the submatch. (Another difference is that the entire string comes back in the match). If I change your expression to '(?:(\w+).*?[\.\?!]\s+?)' then I get all 5 starting words in the submatch. – MonroeGA Jan 10 '13 at 00:44
  • That `\s` will match a space, but so should a literal space character as in my example. Otherwise the expressions are the same. Are you sure you included the literal space in my example, preceeding `{2}`? – Madbreaks Jan 10 '13 at 00:48
  • Yes, I had left a space (cut and paste from above to be double sure). I got the same results in the tester and in VBA. But thanks @Madbreaks, I appreciate your help! – MonroeGA Jan 10 '13 at 00:52
  • Actually, `(\w+).*?[\.\?!]\s` seems to yield the same results as `(?:(\w+).*?[\.\?!]\s)`. – MonroeGA Jan 10 '13 at 00:58
  • 2
    Same results, *but you aren't creating any unnecessary matching groups*. The `(?:)` syntax means "group, but don't create a back reference". If you omit it, you're create a back reference you don't need. Glad I could help, if you're able to please consider upvoting my answer (and others you found helpful too). Cheers – Madbreaks Jan 10 '13 at 01:41
  • Hey @Madbreaks, the system says that since I am new to the site and don't have enough reputation points, I am not allowed to upvote any answers. But thank you for your help, sorry I couldn't officially recognize it on the site. – MonroeGA Jan 11 '13 at 15:50
  • @MonroeGA You *should* still be able to accept the answer as correct by clicking the checkbox to the upper-left of my answer. Thanks – Madbreaks Jan 11 '13 at 17:36
1

The following constructs a non-capturing group for the boundary condition, and captures the word after it with a capturing group.

(?:^|[.?!]\s*)(\w+)

It's not clear from youf question how you are applying the regex to the text, but your regular "pull out another until there are no more matches" loop should work.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • I had been testing in a simple tool trying to better understand Regex constructs overall. My goal was to create a generic method in Excel VBA to run regex queries. By problem was trying to decide if I needed to pull from the match result value or the subitems – MonroeGA Jan 10 '13 at 21:59
  • `results(i) = allMatches.Item(i).Value` or from `results(k) = allMatches.Item(i).submatches.Item(j)` I was trying to see if I could generally construct my regex match strings in such a way to force the results to always be in either the match or the subitem fields, such that I could have a generic method in VBA. Thanks @tripleee – MonroeGA Jan 10 '13 at 22:06
0

This works and is simple:

([A-Z])\w*

VBA requires these flag settings:

Global = True 'Match all occurrences not just first
IgnoreCase = False 'First word of each sentence starts with a capital letter

Here's some additional hard-earned info: since your regex has at least one parenthesis set, you can use Submatches to pull out only the values in the parenthesis and ignore the rest - very useful. Here is the debug output of a function I use to get Submatches, run on your string:

theMatches.Count=5
Match='He'
   Submatch Count=1
   Submatch='H'
Match='Then'
   Submatch Count=1
   Submatch='T'
Match='The'
   Submatch Count=1
   Submatch='T'
Match='How'
   Submatch Count=1
   Submatch='H'
Match='I'
   Submatch Count=1
   Submatch='I'

T

Here's the call to my function that returned the above:

sText = "He hit the ball.  Then he ran.  The crowd was cheering!  How did he feel?  I felt so energized!"
sRegEx = "([A-Z])\w*"
Debug.Print ExecuteRegexCapture(sText, sRegEx, 2, 0) '3rd match, 1st Submatch

And here's the function:

'Returns Submatch specified by the passed zero-based indices:
'iMatch is which match you want,
'iSubmatch is the index within the match of the parenthesis
'containing the desired results.
Function ExecuteRegexCapture(sStringToSearch, sRegEx, iMatch, iSubmatch)
   Dim oRegex As Object
   Set oRegex = New RegExp
   oRegex.Pattern = sRegEx
   oRegex.Global = True 'True = find all matches, not just first
   oRegex.IgnoreCase = False
   oRegex.Multiline = True 'True = [\r\n] matches across line breaks, e.g. "([\r\n].*)" will match next line + anything on it
   bDebug = True

   ExecuteRegexCapture = ""

   Set theMatches = oRegex.Execute(sStringToSearch)
   If bDebug Then Debug.Print "theMatches.Count=" & theMatches.Count

   For i = 0 To theMatches.Count - 1
      If bDebug Then Debug.Print "Match='" & theMatches(i) & "'"
      If bDebug Then Debug.Print "   Submatch Count=" & theMatches(i).SubMatches.Count
      For j = 0 To theMatches(i).SubMatches.Count - 1
         If bDebug Then Debug.Print "   Submatch='" & theMatches(i).SubMatches(j) & "'"
      Next j
   Next i

   If bDebug Then Debug.Print ""

   If iMatch < theMatches.Count Then
      If iSubmatch < theMatches(iMatch).SubMatches.Count Then
         ExecuteRegexCapture = theMatches(iMatch).SubMatches(iSubmatch)
      End If
   End If
End Function