1

Note: I'm using pypi regex module

I have the following regex pattern (flags V1 + VERBOSE):

(?(DEFINE)
  (?P<id>[\d-]+)
)
id:\s(?&id)(,\s(?&id))*

How can I retrieve all the times the <id> group matched ?

For example, in the following text:

don't match this date: 2020-10-22 but match this id: 5668-235 as well as these id: 7788-58-2, 8688-25, 74-44558

I should be able to retrieve the following values:

["5668-235", "7788-58-2", "8688-25", "74-44558"]

Note that this regex match the patterns, but I would like to retrieve everytime a specific group has been matched (even if it is multiple times in the same match object).

François
  • 45
  • 1
  • 6
  • Wrap it with a capturing group. – Wiktor Stribiżew Oct 22 '20 at 13:28
  • Even with capturing groups, on the case where the `` pattern is repeated, like the last one, the middle match is not returned, like in this example: https://regex101.com/r/fDcvJF/3 – François Oct 22 '20 at 13:39
  • Do not look at regex101 *results*, it does not support PyPi regex library, see my answer. Especially [this demo](https://tio.run/##ZZBPT8MwDMXv@RRWDzRhzVSypq0qWCVEkbgMBBKXbYeyZDTS@keJJ8qnL9nGBXGy9X7W87OHb2z6bjFNph16i2D1px7JUCPcgQ3DkJb0oXp8WlWMANDy5dao5Xqj@HbGCCNGFRs3u8jtkpZXRjFGyyL6r7Jr70ZQj2fnQPVdiNDWuGsAG@NA1agLELGI@U3MhYCP4x/uV4FM05yLhYTawZc@HE4VG@30mWZZnnPpByLIU98KGUGW8CSRMg/IYE2HFNbjfFcPeLTa0cDHCxjsewsjmO5y@3xvOmVQW@qfEMEpcPRL3qvX@@e3im2BTdMP). – Wiktor Stribiżew Oct 22 '20 at 13:42

1 Answers1

1

The named capturing groups used inside DEFINE block are used as building blocks later in the pattern, they do not actually capture the text they match when used in the consuming pattern part.

In this particular case, you can use

(?(DEFINE)
  (?P<id>[\d-]+)
)
id:\s+(?P<idm>(?&id))(?:,\s+(?P<idm>(?&id)))*

See this regex demo. The point is using additional named capturing group, I named it idm, you may use any name for it.

See the Python demo:

import regex
pat = r'''(?(DEFINE)
  (?P<id>[\d-]+)
)
id:\s+(?P<idm>(?&id))(?:,\s+(?P<idm>(?&id)))*'''
text = r"don't match this date: 2020-10-22 but match this id: 5668-235 as well as these id: 7788-58-2, 8688-25, 74-44558"
print( [x.captures("idm") for x in regex.finditer(pat, text, regex.VERBOSE)] )
# => [['5668-235'], ['7788-58-2', '8688-25', '74-44558']]
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563