python Non greedy regular expression searching too many data

Question

String: '<td attr="0">str1</td><td attr="5">str2</td><td attr="7">str2</td><td attr="9">str4</td>'

I want to search and get only first "td" tag which contains text: "str2". so I tried two different non greedy expressions as below:

>>> mystring = '<td attr="0">str1</td><td attr="5">str2</td><td attr="7">str2</td><td attr="9">str4</td>'
>>> print re.search("(<td.*?str2.*?</td>)",mystring).group(1)
<td attr="0">str1</td><td attr="5">str2</td>
>>> print re.search(".*(<td.*?str2.*?</td>).*",mystring).group(1)
<td attr="7">str2</td>

Here I was expecting output as "<td attr="5">str2</td>", because I have used non greedy expression in regular expression. What is wrong here and how to fetch the expected search result?

Note: I can not use html parser because my actual data-set is not so much formatted for xml parsing

logi-kal · Accepted Answer · 2017-05-24T19:58:16.723

-1

Use [^>] instead of .:

>>> print re.search("(<td[^>]*?>str2.*?</td>)",mystring).group(1)
<td attr="5">str2</td>

(see demo)

Or, better, use HTMLParser.

EDIT: This regex will match even sub-tags:

(<td[^<]*?(?:<(?!td)[^<]*?)*str2.*?</td>)

edited May 24 '17 at 19:58

answered May 22 '17 at 22:01

logi-kal

7,107
6
31
43

Thanks for your reply. I have not used html parser because my actual data set is not in proper xml format and parser fails. Also in tag, there is a chance to have multiple other child tags like . But the main point is I want to fetch first "td" tag which contains text: "str2". – dagpag May 24 '17 at 03:29
I understand. I think [this](https://regex101.com/r/VSPPFa/3) will work even with sub-tags. – logi-kal May 24 '17 at 09:02

python Non greedy regular expression searching too many data

1 Answers1