Python re.finditer with lookahead and capturing group

Question

I am trying to use re.finditer() to get match objects for overlapping matches by combining with lookahead and capturing group. I wonder if someone can help explain to me how it actually works.

Following code:

# Create a string with letters
target_string = 'GGGAA'

# Create a pattern using lookahead and capturing group
# to find all overlapping matches in str_pattern
str_pattern = r'(?=(GG))'

# Use re.finditer to extract all overlapping matches
result = re.finditer(str_pattern, target_string)

# For each match object in result print out
# groups() group(0)/group(1)
# span group(0)/group(1)
# start group(0)/group(1)
# end group(0)/group(1)
for i, match_obj in enumerate(result, 1):
    print('MO:', i)
    print('Groups:', match_obj.groups())
    print('Group 0:', match_obj.group(0))
    print('Group 1:', match_obj.group(1))
    print('Span group 0:', match_obj.span(0))
    print('Span group 1:', match_obj.span(1))
    print('Start Group 0:', match_obj.start(0))
    print('Start Group 1:', match_obj.start(1))
    print('End Group 0:', match_obj.end(0))
    print('End Group 1:', match_obj.end(1))
    print()

Gives the following results: MO: 1 Groups: ('GG',) Group 0: Group 1: GG Span group 0: (0, 0) Span group 1: (0, 2) Start Group 0: 0 Start Group 1: 0 End Group 0: 0 End Group 1: 2

MO: 2 Groups: ('GG',) Group 0: Group 1: GG Span group 0: (1, 1) Span group 1: (1, 3) Start Group 0: 1 Start Group 1: 1 End Group 0: 1 End Group 1: 3

The code should generate two matches, which is correct MO1 and MO2.

groups() give a tuple with the matched substring and an empty (?) value - why?

group(0) gives an empty (?) value - why?

group(1) gives the captured group - why on position 1 and not 0?

span(0) gives a range of length 1 - why?

span(1) gives a range of length 3, but the captured group (i.e., the pattern) is only of length 2 - why?

end(0) gives the same position as start(0) and only of length 1 - why?

end(0) gives a value that is end of match plus one - why?

How to modify the code so that it gives the correct start and end position for all overlapping matches?

Tried to use re.finditer() to extract all overlapping matches of a pattern in a string, and thereafter extract start and end positions of the match in the string. Expected result was to get correct start and end positions for all overlapping matches of a pattern to a string. Generated incorrect start and end positions.

Your regex seems to be working exactly as intended. Finding overlapping matches requires that the overall match (group 0) be empty, or at least no more than a single character; anything actually matched by it is not eligible to be found by the next iteration. Your desired results are in group 1. — jasonharper, Mar 09 '23 at 15:18

Python re.finditer with lookahead and capturing group

0 Answers0