Regex search For HTML Tag with UUID

Question

I'm trying to match a single HTML tag with an id attribute which is a UUID. I tested it with an external resource to make sure the regex is correct with the same input string. The UUID is extracted dynamically so the string replacement is necessary.

The output I would would expect is for the last line to print:

<tr class="ref_row" id="b9060ff1-015d-4089-a193-8fef57e7c2ef">

This is the code I tried:

content = '<tbody><tr class="ref_row" id="b9060ff1-015d-4089-a193-8fef57e7c2ef"><td><b>01/08/2016 14:41:00</b></td>'
ref = 'b9060ff1-015d-4089-a193-8fef57e7c2ef'
regex = '<[^>]+?id=\"%s\"[^<]*?>' % ref
element_to_link = re.search(regex, content)
print element_to_link.string

The output I get when printing is the whole input string, which would suggest the regex is incorrect. What's going on here?

Please don't suggest that I use Beautiful Soup, this should be possible with regular expressions.

*"Please don't suggest that I use Beautiful Soup, this should be possible to do with regular expressions"* - but **why would you want to do that?** Parsing HTML with regex is [notoriously](http://stackoverflow.com/a/1732454/3001761) foolhardy, and there are plenty of tools out there for *actually parsing HTML*. — jonrsharpe, Aug 05 '16 at 11:38
Please don't roll back my edit; if you have a problem please comment explaining it. — jonrsharpe, Aug 05 '16 at 11:43
The fact that this is about parsing html is completely irrelevant, this could be about any string and the problem would persist. I would like to solve this with regex — Sebastian Smolorz, Aug 05 '16 at 11:45
*"Please read my question, I have used that exact website to test my regex"* - I have done, you never mentioned that. *"Why would you downvote the question?"* - I haven't, yet, despite your hostile attitude. *"if you don't have it please don't comment"* - that's not how this works, clarifying your requirements is exactly what comments are for. If you only want the match, why not just access `.group()`? Where did you get the idea you wanted `.string`? — jonrsharpe, Aug 05 '16 at 11:53
your `content` variable seems to be valid xml , why not try [xpath](http://lxml.de/xpathxslt.html)? — SomeDude, Aug 05 '16 at 11:53
_"I tested it with an external resource to make sure the regex is correct"_, I don't want to use xpath because I wish to take advantage of UUIDs — Sebastian Smolorz, Aug 05 '16 at 11:56
The `[^<]*?` in your regex runs past the end of the tag. `[^>]*` would work much better - (no need for the "?" lazy modifier here). Also, never use the words: "HTML" and "REGEX" in the same sentence here at StackOverflow (unless you wish to incur the wrath of Khan.) — ridgerunner, Aug 05 '16 at 14:51

score 0 · Accepted Answer · answered Aug 05 '16 at 11:57

0

Why won't you use group method? This works for me:

element_to_link.group(0)

answered Aug 05 '16 at 11:57

mateuszb

1,083
10
20

score 0 · Answer 2 · answered Aug 05 '16 at 12:31

0

From the Python re module documentation the MatchObject.string property returns "The string passed to match() or search().". Use one of the methods of MatchObject such as group(), groups() or groupdict().

answered Aug 05 '16 at 12:31

FujiApple

796
5
16

Regex search For HTML Tag with UUID

2 Answers2