0

I am trying to use re.finditer() to get match objects for overlapping matches by combining with lookahead and capturing group. I wonder if someone can help explain to me how it actually works.

Following code:

# Create a string with letters
target_string = 'GGGAA'

# Create a pattern using lookahead and capturing group
# to find all overlapping matches in str_pattern
str_pattern = r'(?=(GG))'

# Use re.finditer to extract all overlapping matches
result = re.finditer(str_pattern, target_string)

# For each match object in result print out
# groups() group(0)/group(1)
# span group(0)/group(1)
# start group(0)/group(1)
# end group(0)/group(1)
for i, match_obj in enumerate(result, 1):
    print('MO:', i)
    print('Groups:', match_obj.groups())
    print('Group 0:', match_obj.group(0))
    print('Group 1:', match_obj.group(1))
    print('Span group 0:', match_obj.span(0))
    print('Span group 1:', match_obj.span(1))
    print('Start Group 0:', match_obj.start(0))
    print('Start Group 1:', match_obj.start(1))
    print('End Group 0:', match_obj.end(0))
    print('End Group 1:', match_obj.end(1))
    print()

Gives the following results: MO: 1 Groups: ('GG',) Group 0: Group 1: GG Span group 0: (0, 0) Span group 1: (0, 2) Start Group 0: 0 Start Group 1: 0 End Group 0: 0 End Group 1: 2

MO: 2 Groups: ('GG',) Group 0: Group 1: GG Span group 0: (1, 1) Span group 1: (1, 3) Start Group 0: 1 Start Group 1: 1 End Group 0: 1 End Group 1: 3

The code should generate two matches, which is correct MO1 and MO2.

groups() give a tuple with the matched substring and an empty (?) value - why?

group(0) gives an empty (?) value - why?

group(1) gives the captured group - why on position 1 and not 0?

span(0) gives a range of length 1 - why?

span(1) gives a range of length 3, but the captured group (i.e., the pattern) is only of length 2 - why?

end(0) gives the same position as start(0) and only of length 1 - why?

end(0) gives a value that is end of match plus one - why?

How to modify the code so that it gives the correct start and end position for all overlapping matches?

Tried to use re.finditer() to extract all overlapping matches of a pattern in a string, and thereafter extract start and end positions of the match in the string. Expected result was to get correct start and end positions for all overlapping matches of a pattern to a string. Generated incorrect start and end positions.

  • Your regex seems to be working exactly as intended. Finding overlapping matches requires that the overall match (group 0) be empty, or at least no more than a single character; anything actually matched by it is not eligible to be found by the next iteration. Your desired results are in group 1. – jasonharper Mar 09 '23 at 15:18

0 Answers0