0

i trying to get all URL's in Tag 'a' from Webpage

for example

$text = file_get_contents ( 'http://stackoverflow.com/' );
$preg = '/<a.+?href=(http:\/\/\w+?\..+?).*?>.+?<\/a>/';
preg_match_all($preg,$text,$result);
echo '<pre>';
   print_r($result['1']);
echo '</pre>';

after that,i think the href= maybe "..." or '...',so change the Regex to

'/<a.+?href=[\'"](http:\/\/\w+?\..+?)[\'"].*?>.+?<\/a>/';

i found the transfer protocol maybe http or https,i change the Regex to

'/<a.+?href=[\'"](https?:\/\/\w+?\..+?)[\'"].*?>.+?<\/a>/';

but it still worked not as expect.

hlfshy
  • 3
  • 3

1 Answers1

0

The answer is to stop trying to parse HTML with regular expressions and learn how to use an XML parser, like the convenient DOM API in PHP.

$html = <<<'HTML'
<a href="http://foobar.baz/firstlink">first link here</a>
<a href='https://www.foobar.quix/secondlink'>second link here</a>
<a href='//www.foobar.quix/thirdlink'>thirdlink here</a>
<a href=/fourthlink>fourthlink here</a>
HTML;
$dom = new DOMDocument;
$dom->loadHTML($html);
$nodes = $dom->getElementsByTagName("a");

foreach($nodes as $node) {
    echo $dom->saveHTML($node), "\n";
}

Ouput

<a href="http://foobar.baz/firstlink">first link here</a>
<a href="https://www.foobar.quix/secondlink">second link here</a>
<a href="//www.foobar.quix/thirdlink">thirdlink here</a>
<a href="/fourthlink">fourthlink here</a>

Now it doesn't matter what's in the anchor tag or how it's formulated. Whether it has single quotes or double quotes or no quotes at all or whether it's starts with http or https or whatever the case may be. You can always get to the href attribute value from $node->getAttribute('href') easily from inside that loop.

foreach($nodes as $node) {
    echo $node->getAttribute("href"), "\n";
}

Output

http://foobar.baz/firstlink
https://www.foobar.quix/secondlink
//www.foobar.quix/thirdlink
/fourthlink
user3942918
  • 25,539
  • 11
  • 55
  • 67
Sherif
  • 11,786
  • 3
  • 32
  • 57