I am trying to use re.finditer() to get match objects for overlapping matches by combining with lookahead and capturing group. I wonder if someone can help explain to me how it actually works.
Following code:
# Create a string with letters
target_string = 'GGGAA'
# Create a pattern using lookahead and capturing group
# to find all overlapping matches in str_pattern
str_pattern = r'(?=(GG))'
# Use re.finditer to extract all overlapping matches
result = re.finditer(str_pattern, target_string)
# For each match object in result print out
# groups() group(0)/group(1)
# span group(0)/group(1)
# start group(0)/group(1)
# end group(0)/group(1)
for i, match_obj in enumerate(result, 1):
print('MO:', i)
print('Groups:', match_obj.groups())
print('Group 0:', match_obj.group(0))
print('Group 1:', match_obj.group(1))
print('Span group 0:', match_obj.span(0))
print('Span group 1:', match_obj.span(1))
print('Start Group 0:', match_obj.start(0))
print('Start Group 1:', match_obj.start(1))
print('End Group 0:', match_obj.end(0))
print('End Group 1:', match_obj.end(1))
print()
Gives the following results: MO: 1 Groups: ('GG',) Group 0: Group 1: GG Span group 0: (0, 0) Span group 1: (0, 2) Start Group 0: 0 Start Group 1: 0 End Group 0: 0 End Group 1: 2
MO: 2 Groups: ('GG',) Group 0: Group 1: GG Span group 0: (1, 1) Span group 1: (1, 3) Start Group 0: 1 Start Group 1: 1 End Group 0: 1 End Group 1: 3
The code should generate two matches, which is correct MO1 and MO2.
groups() give a tuple with the matched substring and an empty (?) value - why?
group(0) gives an empty (?) value - why?
group(1) gives the captured group - why on position 1 and not 0?
span(0) gives a range of length 1 - why?
span(1) gives a range of length 3, but the captured group (i.e., the pattern) is only of length 2 - why?
end(0) gives the same position as start(0) and only of length 1 - why?
end(0) gives a value that is end of match plus one - why?
How to modify the code so that it gives the correct start and end position for all overlapping matches?
Tried to use re.finditer() to extract all overlapping matches of a pattern in a string, and thereafter extract start and end positions of the match in the string. Expected result was to get correct start and end positions for all overlapping matches of a pattern to a string. Generated incorrect start and end positions.