0

If I have the following X(HTML) structure, how do you go about capturing that imgur link deep within the div tree?

I tried several different methods. What I really want is to make a node tree for the div containing "siteTable" because there are many div's within that div that contain more imgur links. If you haven't noticed, this is the html for reddit.

Thanks!

<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<body class="listing-page hot-page">
    <div id="header" role="banner">
    <div class="side">
    <a name="content"></a>
    <div class="content" role="main">
    <div class="infobar welcome">
    <div id="siteTable" class="sitetable linklisting">
        <div class=" thing id-t3_1gh823 over18 odd link " data-downs="5" data-ups="90" data-fullname="t3_1gh823" onclick="click_thing(this)">
            <p class="parent"></p>
            <span class="rank" style="width:2.20ex;">1</span>
            <div class="midcol unvoted" style="width:5ex;">
            <a class="thumbnail " href="http://i.imgur.com/FZ1I9wi.jpg">

This is what I know needs to be done:

    $dom = new domDocument;


    @$dom->loadHTML(file_get_contents($link));


    $dom->preserveWhiteSpace = false;


    $xpath = new DOMXPath($dom);

    $href = $xpath->query('?????');

    print_r($tags);
the tao
  • 253
  • 2
  • 6
  • 13

2 Answers2

3

I always try to make my XPath's as basic, but specific as possible. This makes it easier to change and debug as the page changes. Its hard to say without looking at the whole page, or multiple reddit pages..but I am assuming that the class thumbnail is only used for the thumbnail link you want. In this case we can make a really simple (but specific) XPath query:

$link_nodes = $xpath->query('//a[@class="thumbnail"]');
if($link_nodes->length > 0) {
  // you can do a foreach loop here if there may be multiple links?
  $link_node = $link_nodes->item(0);
  $href = $link_node->attributes->getNamedItem('href')->value;
}

Also, you may want to make sure you are getting an imgur link by enhancing the XPath query:

$link_nodes = $xpath->query('//a[@class="thumbnail"][contains(@href, "imgur.com")]');
Sam
  • 20,096
  • 2
  • 45
  • 71
  • Sorry, I made an edit..that was a typo. You can technically use `//img` in XPath queries, since `` is an HTML element. However, you should be using `//a` since we are looking for a link. – Sam Jun 17 '13 at 05:23
  • This does not return anything, and I am confused because you are not using any sub queries, you seem to be creating a query that goes directly to the line I want, and I didn't know this was possible. – the tao Jun 17 '13 at 05:27
  • 1
    This is good but it will fail without the extra space in the class attribute. Also, try just: `$link_node->getAttribute('href')` – pguardiario Jun 17 '13 at 06:37
  • This simply is not returning anything, I have included the extra space. $link_nodes is returning empty – the tao Jun 17 '13 at 07:06
  • Ah! The reason was because file_get_contents was not returning html, it was returning json. – the tao Jun 17 '13 at 07:38
  • 1
    @thetao, XPath can be really flexible and allows you to access nodes directly without going through the whole DOM structure. For future reference look at http://www.freeformatter.com/xpath-tester.html, you'll find some useful examples and a simple form to test your XPath expressions. – Rolando Isidoro Jun 17 '13 at 10:08
0

You can take help of HTML DOM parser. Download and include it in your script. Then parse the url using below code.

How to include the script:

if (!function_exists('file_get_html')) {

require_once( 'public/frontend/simple_html_dom.php');

}

How to parse:

$scrape_url = 'http://www.example.com/a.php';

$html = file_get_html($scrape_url);

echo $html->find('div[siteTable]');

You will also get the full tutorial in that site.

Debashis
  • 566
  • 2
  • 14
  • 34