Complex non-greedy matching with regular expressions

Question

I'm trying to parse rows from a HTML table with cells containing specific values with regular expressions in Python. My aim in this (contrived) example is to get the rows with "cow".

import re

response = '''
<tr class="someClass"><td></td><td>chicken</td></tr>
<tr class="someClass"><td></td><td>chicken</td></tr>
<tr class="someClass"><td></td><td>cow</td></tr>
<tr class="someClass"><td></td><td>cow</td></tr>
<tr class="someClass"><td></td><td>cow</td></tr>
'''

r = re.compile(r'<tr.*?cow.*?tr>', re.DOTALL)

for m in r.finditer(response):
  print m.group(0), "\n"

My output is

<tr class="someClass"><td></td><td>chicken</td></tr> <tr class="someClass"><td></td><td>chicken</td></tr> <tr class="someClass"><td></td><td>cow</td></tr>

<tr class="someClass"><td></td><td>cow</td></tr>

While my aim is to get

<tr class="someClass"><td></td><td>cow</td></tr>

I understand that the non-greedy ? doesn't work in this case because of how backtracking works. I fiddled around with negative lookbehinds and lookahead but can't get it to work.

Does anybody have suggestions?

I'm aware of solutions like Beautiful Soup, etc. but the question is about understanding regular expressions, not the problem per se.

To address concerns of people about not using regular expressions for HTML. The general problem I want to solve using regular expressions ONLY is to get from

response = '''0randomstuffA1randomstuff10randomstuffA2randomstuff10randomstuffB3randomstuff10randomstuffB4randomstuff10randomstuffB5randomstuff1'''

the output

0randomstuffB3randomstuff1 

0randomstuffB4randomstuff1 

0randomstuffB5randomstuff1

and randomstuff should be interpreted as random strings (but not containing 0 or 1).

If your question is not about HTML, maybe you should not include HTML examples (they should not be parsed with regex) — Vasili Syrakis, Jun 08 '16 at 08:25

score 4 · Accepted Answer · answered Jun 08 '16 at 13:38

Your problem isn't related to the greediness but to the fact that the regex engine tries to succeed at each position in the string from left to right. That's why you will always obtain the leftmost result and using a non-greedy quantifier will not change the starting position!

If you write something like: <tr.*?cow.*?tr> or 0.*?B.*?1 (for your second example) the patterns are first tried:

  <tr class="someClass"><td></td><td>chicken</td></tr>...
# ^-----here

# or

  0randomstuffA1randomstuff10randomstuffA2randomstuff10randomstuffB3ra...
# ^-----here

And the first .*? will eat characters until "cow" or "B". Result, the first match is:

<tr class="someClass"><td></td><td>chicken</td></tr>
<tr class="someClass"><td></td><td>chicken</td></tr>
<tr class="someClass"><td></td><td>cow</td></tr>

for your first example, and:

0randomstuffA1randomstuff10randomstuffA2randomstuff10randomstuffB3randomstuff1

for the second.

To obtain what you want, you need to make the patterns fail at unwanted positions in the string. To do that .*? is useless because too permissive.

You can for instance forbid a </tr> or a 1 to occur before "cow" or "B".

# easy to write but not very efficient (with DOTALL)
<tr\b(?:(?!</tr>).)*?cow.*?</tr>

# more efficient
<tr\b[^<c]*(?:<(?!/tr>)[^<c]*|c(?!ow)[^<c]*)*cow.*?</tr>

# easier to write when boundaries are single characters
0[^01B]*B[^01]*1

In the first regex, what's the use of `\b` and `.` after `tr>)` ? And can it be simplified to [this](https://regex101.com/r/lI1hD1/1) ? — Anmol Singh Jaggi, Jun 09 '16 at 09:08
@AnmolSinghJaggi: `\b` is a word-boundary to ensure there are no more letters after `tr` (in case of the document contains exotic tags). It's used as a kind of shortcut to say there is a whitespace or a closing angle bracket after `).` matches any character that is not the start of ``. `(?!...)` is a negative lookahead and means *not followed by*. It's a *zero-width assertion*, this means that it's only a test and doesn't consume characters. — Casimir et Hippolyte, Jun 09 '16 at 12:02

score 2 · Answer 2 · edited May 23 '17 at 11:52

If the input string contains each tag on a separate line, Moses Koledoye's answer would work.
However, if the tags are spread out over multiple lines, the following would be needed:

import re


response = '''
<tr class="someClass
"><td></td><td>chicken</td></tr><tr class="someClass"><td></td><td>chic
ken</td></tr><tr class="someClass"><td></td><td>cow</td></tr><tr class="someC
lass"><td></td><td>cow</td></tr><tr
class="someClass"><td></td><td>c
ow
</td></tr>
'''


# Remove all the newlines
# Required only if words like 'cow' and '<tr' are split between 2 lines
response = response.replace('\n', '')

r1 = re.compile(r'<tr.*?tr>', re.DOTALL)
r2 = re.compile(r'.*cow.*', re.DOTALL)

for m in r1.finditer(response):
    n = r2.match(m.group())
    if n:
        print n.group(), '\n'

Note that this would work even if the tags were on separate lines as shown in the example string you provided, so this is a more general solution.

I think this is a good answer which uses regular expressions only. Just out of curiosity I would be interested to knowing if somebody knows a oneline regular expressions which solves this problem. — user2940666, Jun 08 '16 at 08:57

score 0 · Answer 3 · answered Jun 08 '16 at 08:27

If your 'response' string always contains newlines then you can do what you need without regex. Use the built-in split function to create a list of each line. Then iterate over the list and see if 'cow' is in the line:

response = '''
<tr class="someClass"><td></td><td>chicken</td></tr>
<tr class="someClass"><td></td><td>chicken</td></tr>
<tr class="someClass"><td></td><td>cow</td></tr>
<tr class="someClass"><td></td><td>cow</td></tr>
<tr class="someClass"><td></td><td>cow</td></tr>
'''

lines = response.split('\n')
cows = []
for line in lines:
    if 'cow' in line:
        cows.append(line)
print(cows)

output:

['<tr class="someClass"><td></td><td>cow</td></tr>', '<tr class="someClass"><td></td><td>cow</td></tr>', '<tr class="someClass"><td></td><td>cow</td></tr>']

score 0 · Answer 4 · answered Jun 08 '16 at 08:31

You don't really need regex for this at all.

As soon as you add the ? quantifier to your expression, you've made the token lazy (non-greedy).

Anyway, you could just do:

for line in example:
    if 'cow' in line:
        print(line)

no regex required.

If you want to know what a "non-greedy" match does, it does this:

import re

lazy = r'[a-z]*?b'
#             ^^ lazy
greedy = r'[a-z]*b'
#               ^ greedy

string = 'aaabbbaaabbb'

print(re.match(lazy, string))
print(re.match(greedy, string))

output

<_sre.SRE_Match object; span=(0, 4), match='aaab'>
<_sre.SRE_Match object; span=(0, 12), match='aaabbbaaabbb'>

Notice that the first match will match until the first 'b' it encounters. That's because it is trying to match as few times as possible (lazy).

The greedy match will match until the last 'b', because it tries to match as many times as possible.

Both matches will 'give back as needed', that is to say, if there are other tokens that could match, it might use those instead.

Complex non-greedy matching with regular expressions

4 Answers4