How to get URL's in all Tag 'a' from a Webpage(PHP)?My code worked not as expect

Question

i trying to get all URL's in Tag 'a' from Webpage

for example

$text = file_get_contents ( 'http://stackoverflow.com/' );
$preg = '/<a.+?href=(http:\/\/\w+?\..+?).*?>.+?<\/a>/';
preg_match_all($preg,$text,$result);
echo '<pre>';
   print_r($result['1']);
echo '</pre>';

after that,i think the href= maybe "..." or '...',so change the Regex to

'/<a.+?href=[\'"](http:\/\/\w+?\..+?)[\'"].*?>.+?<\/a>/';

i found the transfer protocol maybe http or https,i change the Regex to

'/<a.+?href=[\'"](https?:\/\/\w+?\..+?)[\'"].*?>.+?<\/a>/';

but it still worked not as expect.

score 0 · Accepted Answer · edited Sep 07 '16 at 03:48

0

The answer is to stop trying to parse HTML with regular expressions and learn how to use an XML parser, like the convenient DOM API in PHP.

$html = <<<'HTML'
<a href="http://foobar.baz/firstlink">first link here</a>
<a href='https://www.foobar.quix/secondlink'>second link here</a>
<a href='//www.foobar.quix/thirdlink'>thirdlink here</a>
<a href=/fourthlink>fourthlink here</a>
HTML;
$dom = new DOMDocument;
$dom->loadHTML($html);
$nodes = $dom->getElementsByTagName("a");

foreach($nodes as $node) {
    echo $dom->saveHTML($node), "\n";
}

Ouput

<a href="http://foobar.baz/firstlink">first link here</a>
<a href="https://www.foobar.quix/secondlink">second link here</a>
<a href="//www.foobar.quix/thirdlink">thirdlink here</a>
<a href="/fourthlink">fourthlink here</a>

Now it doesn't matter what's in the anchor tag or how it's formulated. Whether it has single quotes or double quotes or no quotes at all or whether it's starts with http or https or whatever the case may be. You can always get to the href attribute value from $node->getAttribute('href') easily from inside that loop.

foreach($nodes as $node) {
    echo $node->getAttribute("href"), "\n";
}

Output

http://foobar.baz/firstlink
https://www.foobar.quix/secondlink
//www.foobar.quix/thirdlink
/fourthlink

edited Sep 07 '16 at 03:48

user3942918

25,539
11
55
67

answered Sep 07 '16 at 03:36

Sherif

11,786
3
32
57

`$html = <<<'HTML' HTML;`if the tag a has JavaScript or '#',Is there a way to filter them？ – hlfshy Sep 07 '16 at 04:11
The question answers itself. Read the `href` and `onclick` attributes of the element. – Sherif Sep 07 '16 at 04:37
Sorry,My English is not good. i mean , use your code , ` and `， both of them will output,but i want to get the the last one ,only get url. It's maybe use some Regex or Something else to distinguish the url? The `href` from webpage always had much types,it maked me crazy to choose the url,i spent much time to write Regex . – hlfshy Sep 07 '16 at 07:58
How is regex going to make solving that problem any easier for you? It's not. You still need to figure out if `href` contains a valid url or something else. The compounded problem you invented here is to try and do both (*parsing html AND validating the url*) in one single fell-swoop. That's not easy. So break the problem down into two simpler steps. 1. Use DOM to parse the HTML and extract all the `` tags. 2. Verify if the `` tag's `href` attribute is a valid URL. – Sherif Sep 07 '16 at 08:01
thanks very much.i got it. I think i go the wrong way before，always want to solve it in one step.It's too hard for me.now i finish it,perfect out the corrent url. – hlfshy Sep 07 '16 at 09:40

How to get URL's in all Tag 'a' from a Webpage(PHP)?My code worked not as expect

1 Answers1