regular expression anchor tag

Question

i am using php and i am having problem to parse the href from anchor tag with text.

example: anchor tag having test http://www.test.com

like this <a href="http://www.test.com" title="test">http://www.test.com</a>

i want to match all text in anchor tag

thanks in advance.

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454: don't parse HTML with regex. — Katriel, Jul 29 '10 at 09:51
two questions, 1st: do you want to match test or h ttp://www.test.com? 2nd: do you want to match it ` or here? `? — Ties, Jul 29 '10 at 09:53

Daniel Egeberg · Answer 1 · 2010-07-29T10:20:55.503

6

Use DOM:

$text = '<a href="http://www.test.com" title="test">http://www.test.com</a> something else hello world';
$dom = new DOMDocument();
$dom->loadHTML($text);

foreach ($dom->getElementsByTagName('a') as $a) {
    echo $a->textContent;
}

DOM is specifically designed to parse XML and HTML. It will be more robust than any regex solution you can come up with.

edited Jul 29 '10 at 10:20

answered Jul 29 '10 at 10:10

Daniel Egeberg

8,359
31
44

Not that there's anything "wrong" with how you did it, why didn't you just use `DomElement::getElementsByTagName()` instead of the XPath query? It should be more efficient for that simple path... – ircmaxell Jul 29 '10 at 10:18

score -1 · Answer 2 · answered Jul 29 '10 at 10:09

Assuming you wish to select the link text of an anchor link with that href, then something like this should work...

$input = '<a href="http://www.test.com" title="test">http://www.test.com</a>';
$pattern = '#<a href="http://www\.test\.com"[^>]*>(.*?)</a>#';

if (preg_match($pattern, $input, $out)) {
    echo $out[1];
}

This is technically not perfect (in theory > can probably be used in one of the tags), but will work in 99% of cases. As several of the comments have mentioned though, you should be using a DOM.

score -1 · Answer 3 · answered Jul 29 '10 at 10:09

If you have already obtained the anchor tag you can extract the href attribute via a regex easily enough:

<a [^>]*href="([^"])"[^>]*>

If you instead want to extract the contents of the tag and you know what you are doing, it isn't too hard to write a simple recursive descent parser, using cascading regexes, that will parse all but the most pathological cases. Unfortunately PHP isn't a good language to learn how to do this, so I wouldn't recommend using this project to learn how.

So if it is the contents you are after, not the attribute, then @katrielalex is right: don't parse HTML with regex. You will run into a world of hurt with nested formatting tags and other legal HTML that isn't compatible with regular expressions.

regular expression anchor tag

3 Answers3