The answer is to stop trying to parse HTML with regular expressions and learn how to use an XML parser, like the convenient DOM API in PHP.
$html = <<<'HTML'
<a href="http://foobar.baz/firstlink">first link here</a>
<a href='https://www.foobar.quix/secondlink'>second link here</a>
<a href='//www.foobar.quix/thirdlink'>thirdlink here</a>
<a href=/fourthlink>fourthlink here</a>
HTML;
$dom = new DOMDocument;
$dom->loadHTML($html);
$nodes = $dom->getElementsByTagName("a");
foreach($nodes as $node) {
echo $dom->saveHTML($node), "\n";
}
Ouput
<a href="http://foobar.baz/firstlink">first link here</a>
<a href="https://www.foobar.quix/secondlink">second link here</a>
<a href="//www.foobar.quix/thirdlink">thirdlink here</a>
<a href="/fourthlink">fourthlink here</a>
Now it doesn't matter what's in the anchor tag or how it's formulated. Whether it has single quotes or double quotes or no quotes at all or whether it's starts with http
or https
or whatever the case may be. You can always get to the href
attribute value from $node->getAttribute('href')
easily from inside that loop.
foreach($nodes as $node) {
echo $node->getAttribute("href"), "\n";
}
Output
http://foobar.baz/firstlink
https://www.foobar.quix/secondlink
//www.foobar.quix/thirdlink
/fourthlink