Why does this regex capture a maximum of 2 capture groups and not all those within the input string?

Question

import re

def verify_need_to_restructure_where_capsule(m):
    capture_where_capsule = str(m.group(1))
    print(capture_where_capsule)
    return capture_where_capsule


input_text = "Rosa está esperándote ((PL_ADVB='saassa')abajo). Estábamos ((PL_ADVB='la casa cuando comenzó el temporal')dentro). Los libros que buscas están ((PL_ADVB='en la estantería de allí arriba')arriba); Conociéndole, quizás ya tenga las cosas preparadas ((PL_ADVB='mente mesa principal, ademas la hemos arreglado')sobre)"


list_all_adverbs_of_place = ["adentro", "dentro", "arriba de", "arriba", "al medio", "abajo", "hacía", "hacia", "por sobre", "sobre las","sobre la", "sobre el", "sobre"]
place_reference = r"(?i:\w\s*)+"
pattern = re.compile(r"(\(\(PL_ADVB='" + place_reference + r"'\)" + rf"({'|'.join(list_all_adverbs_of_place)})" + r"\))", re.IGNORECASE)

input_text = re.sub(pattern, verify_need_to_restructure_where_capsule, input_text, re.IGNORECASE)

Even if you try with several input_text in all cases it is limited (at most) to capture the first 2 matches, but not all the occurrences that actually exist

((PL_ADVB='saassa')abajo)
((PL_ADVB='la casa cuando comenzó el temporal')dentro)

This should be the correct output, that is, when it succeeds in identifying all occurrences and not just the first 2 matches.

((PL_ADVB='saassa')abajo)
((PL_ADVB='la casa cuando comenzó el temporal')dentro)
((PL_ADVB='en la estantería de allí arriba')arriba)
((PL_ADVB='mente mesa principal, ademas la hemos arreglado')sobre)

It's quite curious because if I invert the order of the capture groups within the string, the pattern will detect them, but always limited to the first 2. It is as if the re.sub() method had passed the parameter to replace n number of times (in this case like 2 times), but in that case I am not indicating that parameter, and even so re.sub() just works a limited number of times.

EDIT (with findall):

import re

def verify_need_to_restructure_where_capsule(m):
    capture_where_capsule = str(m.group(1))
    print(capture_where_capsule)
    return capture_where_capsule


input_text = "Rosa está esperándote ((PL_ADVB='saassa')abajo). Estábamos ((PL_ADVB='la casa cuando comenzó el temporal')dentro). Los libros que buscas están ((PL_ADVB='en la estantería de allí arriba')arriba); Conociéndole, quizás ya tenga las cosas preparadas ((PL_ADVB='mente mesa principal, ademas la hemos arreglado')sobre)"


list_all_adverbs_of_place = ["adentro", "dentro", "arriba de", "arriba", "al medio", "abajo", "hacía", "hacia", "por sobre", "sobre las","sobre la", "sobre el", "sobre"]
place_reference = r"(?i:\w\s*)+"
pattern = re.compile(r"(\(\(PL_ADVB='" + place_reference + r"'\)" + rf"({'|'.join(list_all_adverbs_of_place)})" + r"\))", re.IGNORECASE)

print(re.findall(pattern, input_text))

input_text = re.sub(pattern, verify_need_to_restructure_where_capsule, input_text, re.IGNORECASE)

Use `re.findall()` to see what it's matching. I get 3 matches, not 2. — Barmar, Jan 24 '23 at 20:11
It doesn't match `((PL_ADVB='mente mesa principal, ademas la hemos arreglado')sobre)` because `sobre` isn';t in `list_all_adverbs_of_place`. It has `por sobre`, `sobre las`, `sobre la` and `sobre el`, but not `sobre` by itself. — Barmar, Jan 24 '23 at 20:14
When I asked the question I forgot to put the element `"sobre"` inside the list, even so now I update it and you can see that it still doesn't capture it. Using `print(re.findall(pattern, input_text))` gives me only the first 2 matches `[("((PL_ADVB='saassa')abajo)", 'abajo'), ("((PL_ADVB='la casa cuando comenzó el temporal')dentro)", 'dentro'), ("((PL_ADVB='en la estantería de allí arriba')arriba)", 'arriba')]` — Matt095, Jan 24 '23 at 20:17
@Barmar I have noticed that it is limited to 2 matches, being element `"sobre"` or without placing it. With the code that I have now updated, you can try it and reproduce this problem with the `re.sub()` method. — Matt095, Jan 24 '23 at 20:21
`place_reference` only matches alphanumeric and whitespace characters after `PL_ADVB`. The missing matches have commas in those strings. — Barmar, Jan 24 '23 at 20:25
Where is this data coming from? It seems like it's formatted from some original data, wouldn't it be easier to work with the original data rather than parsing the formatted result? — Barmar, Jan 24 '23 at 20:28
I thought the same thing, but the string `((PL_ADVB='en la estantería de allí arriba')arriba)` has no commas or punctuation marks in between, and it still isn't captured. — Matt095, Jan 24 '23 at 20:28
I get that in my results when I do `pattern.findall(input_text)`. I don't know why you don't. — Barmar, Jan 24 '23 at 20:29
I changed my link to use your code and it still prints 3 items. The only difference is that I originally used `pprint()` to make the output eaier to read. — Barmar, Jan 24 '23 at 20:36
@Barmar Using `.findall()` captures `[("((PL_ADVB='saassa')abajo)", 'abajo'), ("((PL_ADVB='la casa cuando comenzó el temporal')dentro)", 'dentro'), ("((PL_ADVB='en la estantería de allí arriba')arriba)", 'arriba')]` but using re.sub() only captures 2 of them — Matt095, Jan 24 '23 at 20:37
check the documentation of `re.sub()`. The 4th positional argument is `count`, the limit of replacements to make. You're passing `re.IGNORECASE`, whose value is `2`, so it only does 2 replacements. — Barmar, Jan 24 '23 at 20:40
you don't need that argument, since you specified the flag in `re.compile()`. But when you do need to pass it, use a named argument: `flags = re.IGNORECASE` — Barmar, Jan 24 '23 at 20:42
Thank you very much, you are right, specifying the argument, in this case `flags` , if it manages to capture the 3 capture groups. , and the only capturing group that fails to capture is the one with a punctuation mark (such as `,` `;` or `.`). Could I still use this pattern `place_reference = r"(?i:\w\s*)+"` so that it works with all 4 matches? Or it could no longer indicate alphanumeric, and it would have to be extended to a few more symbols, the problem is that I don't want to change the pattern too much and affect the other 3 capture groups, could I just specify these symbols? — Matt095, Jan 24 '23 at 20:50
`place_reference = r'(?:[\w,]\s*)+'` to match commas. What's the `i` for? — Barmar, Jan 24 '23 at 20:53
It worked for me, even extending it to points `place_reference = r'(?:[\w,;.]\s*)+'`, I have placed it and tried to remove the `re.IGNORECASE`, since I thought it was affecting the argument that limits the number of replacements, although now I think it is no longer necessary — Matt095, Jan 24 '23 at 21:04
If you want to match everything up to the closing `'`, you can use `[^']` instead of listing the characters to allow. — Barmar, Jan 24 '23 at 21:09

score 1 · Accepted Answer · answered Jan 25 '23 at 05:44

you can capture the specific part in the matched string using capture groups and then validate the string is present or not.

import re

input_text = "Rosa está esperándote ((PL_ADVB='saassa')abajo). Estábamos ((PL_ADVB='la casa cuando comenzó el temporal')dentro). Los libros que buscas están ((PL_ADVB='en la estantería de allí arriba')arriba); Conociéndole, quizás ya tenga las cosas preparadas ((PL_ADVB='mente mesa principal, ademas la hemos arreglado')sobre)"

list_all_adverbs_of_place = ["adentro", "dentro", "arriba de", "arriba", "al medio", "abajo", "hacía", "hacia",
                             "por sobre", "sobre las", "sobre la", "sobre el", "sobre"]

regex_pattern = r'\(\(\w+\=\W(?:\w+\W?\s?)+\)(\w+)\)'
matches = []
data = re.finditer(regex_pattern, input_text)
for i in data:
    if i.group(1) in list_all_adverbs_of_place:
        matches.append(i.group())
print(matches)

>>> ["((PL_ADVB='saassa')abajo)", "((PL_ADVB='la casa cuando comenzó el temporal')dentro)", "((PL_ADVB='en la estantería de allí arriba')arriba)", "((PL_ADVB='mente mesa principal, ademas la hemos arreglado')sobre)"]

Why does this regex capture a maximum of 2 capture groups and not all those within the input string?

1 Answers1