-2

String: '<td attr="0">str1</td><td attr="5">str2</td><td attr="7">str2</td><td attr="9">str4</td>'

I want to search and get only first "td" tag which contains text: "str2". so I tried two different non greedy expressions as below:

>>> mystring = '<td attr="0">str1</td><td attr="5">str2</td><td attr="7">str2</td><td attr="9">str4</td>'
>>> print re.search("(<td.*?str2.*?</td>)",mystring).group(1)
<td attr="0">str1</td><td attr="5">str2</td>
>>> print re.search(".*(<td.*?str2.*?</td>).*",mystring).group(1)
<td attr="7">str2</td>

Here I was expecting output as "<td attr="5">str2</td>", because I have used non greedy expression in regular expression. What is wrong here and how to fetch the expected search result?

Note: I can not use html parser because my actual data-set is not so much formatted for xml parsing

dagpag
  • 3
  • 2

1 Answers1

-1

Use [^>] instead of .:

>>> print re.search("(<td[^>]*?>str2.*?</td>)",mystring).group(1)
<td attr="5">str2</td>

(see demo)

Or, better, use HTMLParser.

EDIT: This regex will match even sub-tags:

(<td[^<]*?(?:<(?!td)[^<]*?)*str2.*?</td>)
logi-kal
  • 7,107
  • 6
  • 31
  • 43
  • Thanks for your reply. I have not used html parser because my actual data set is not in proper xml format and parser fails. Also in tag, there is a chance to have multiple other child tags like . But the main point is I want to fetch first "td" tag which contains text: "str2". – dagpag May 24 '17 at 03:29
  • I understand. I think [this](https://regex101.com/r/VSPPFa/3) will work even with sub-tags. – logi-kal May 24 '17 at 09:02