I'm trying to parse rows from a HTML table with cells containing specific values with regular expressions in Python. My aim in this (contrived) example is to get the rows with "cow".
import re
response = '''
<tr class="someClass"><td></td><td>chicken</td></tr>
<tr class="someClass"><td></td><td>chicken</td></tr>
<tr class="someClass"><td></td><td>cow</td></tr>
<tr class="someClass"><td></td><td>cow</td></tr>
<tr class="someClass"><td></td><td>cow</td></tr>
'''
r = re.compile(r'<tr.*?cow.*?tr>', re.DOTALL)
for m in r.finditer(response):
print m.group(0), "\n"
My output is
<tr class="someClass"><td></td><td>chicken</td></tr>
<tr class="someClass"><td></td><td>chicken</td></tr>
<tr class="someClass"><td></td><td>cow</td></tr>
<tr class="someClass"><td></td><td>cow</td></tr>
<tr class="someClass"><td></td><td>cow</td></tr>
While my aim is to get
<tr class="someClass"><td></td><td>cow</td></tr>
<tr class="someClass"><td></td><td>cow</td></tr>
<tr class="someClass"><td></td><td>cow</td></tr>
I understand that the non-greedy ? doesn't work in this case because of how backtracking works. I fiddled around with negative lookbehinds and lookahead but can't get it to work.
Does anybody have suggestions?
I'm aware of solutions like Beautiful Soup, etc. but the question is about understanding regular expressions, not the problem per se.
To address concerns of people about not using regular expressions for HTML. The general problem I want to solve using regular expressions ONLY is to get from
response = '''0randomstuffA1randomstuff10randomstuffA2randomstuff10randomstuffB3randomstuff10randomstuffB4randomstuff10randomstuffB5randomstuff1'''
the output
0randomstuffB3randomstuff1
0randomstuffB4randomstuff1
0randomstuffB5randomstuff1
and randomstuff should be interpreted as random strings (but not containing 0 or 1).