php proDOM parsing error

Question

I am using the following code for parsing dom document but at the end I get the error "google.ac" is null or not an object line 402 char 1

What I guess, line 402 contains tag and a lot of ";", How can I fix this?

<?php

//$ch = curl_init("http://images.google.com/images?q=books&tbm=isch/");


// create a new cURL resource
$ch = curl_init();

// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "http://images.google.com/images?q=books&tbm=isch/");
curl_setopt($ch, CURLOPT_HEADER, 0);

// grab URL and pass it to the browser
$data = curl_exec($ch);

curl_close($ch); 

$dom = new DOMDocument();
       $dom->loadHTML($data);
    //@$dom->saveHTMLFile('newfolder/abc.html')

     $dom->loadHTML('$data');

    // find all ul

    $list = $dom->getElementsByTagName('ul'); 
    // get few  list items 

    $rows = $list->item(30)->getElementsByTagName('li'); 
    // get anchors from the table   

    $links = $list->item(30)->getElementsByTagName('a'); 

    foreach ($links as $link) { 
        echo "<fieldset>"; 
        $links = $link->getElementsByAttribute('imgurl');

    $dom->saveXML($links);
                }
?>

then how i can do it? what should i do? my basic aim is to get imrurl from the code..and possible save it.. — Zaffar Saffee, Jan 15 '12 at 14:51
sorry, i got your point chx, its $data. when i was trying this, i tried to use get_matche() to extract only required tags, but when pasting the code here, i forgot to change the variable..updating now, thanks dear — Zaffar Saffee, Jan 15 '12 at 17:55
Grabbing images from Google's Image Search is against their Terms and Conditions, and subject to breaking at any time when they decide to change their HTML structure. — Pekka, Jan 15 '12 at 17:57
as for as your point is concerned dear pekka, first of all it is not for commercial purpose, its purely educational and learning purpose. 2ndly, what i intend to do is grabbing image URL from google code, not the google image. and as for as my knowledge is concerned, if we use URL for an image, it will point to its owener, its not against any law or term and condition. i will grab the URL and will save it in mysql database. then i will created a page using URL from mysql..however thanks for your sharing dear. — Zaffar Saffee, Jan 15 '12 at 18:07

score 1 · Accepted Answer · edited May 23 '17 at 12:29

There are a few issues with the code:

You should add the CURL option - CURLOPT_RETURNTRANSFER - in order to capture the output. By default the output is displayed on the browser. Like this: curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);. In the code above, $data will always be TRUE or FALSE (http://www.php.net/manual/en/function.curl-exec.php)
$dom->loadHTML('$data'); is not correct and not required
The method of reading 'li' and 'a' tags might not be correct because $list->item(30) will always point to the 30th element

Anyways, coming to the fixes. I'm not sure if you checked the HTML returned by the CURL request but it seems different from what we discussed in the original post. In other words, the HTML returned by CURL does not contain the required <ul> and <li> elements. It instead contains <td> and <a> elements.

Add-on: I'm not very sure why do HTML for the same page is different when it is seen from the browser and when read from PHP. But here is a reasoning that I think might fit. The page uses JavaScript code that renders some HTML code dynamically on page load. This dynamic HTML can be seen when viewed from the browser but not from PHP. Hence, I assume the <ul> and <li> tags are dynamically generated. Anyways, that isn't of our concern for now.

Therefore, you should modify your code to parse the <a> elements and then read the image URLs. This code snippet might help:

<?php
$ch = curl_init(); // create a new cURL resource

// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "http://images.google.com/images?q=books&tbm=isch/");
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);

$data = curl_exec($ch); // grab URL and pass it to the browser
curl_close($ch); 

$dom = new DOMDocument();
@$dom->loadHTML($data); // avoid warnings

$listA = $dom->getElementsByTagName('a'); // read all <a> elements
foreach ($listA as $itemA) { // loop through each <a> element
    if ($itemA->hasAttribute('href')) { // check if it has an 'href' attribute
        $href = $itemA->getAttribute('href'); // read the value of 'href'
        if (preg_match('/^\/imgres\?/', $href)) { // check that 'href' should begin with "/imgres?"
            $qryString = substr($href, strpos($href, '?') + 1);
            parse_str($qryString, $arrHref); // read the query parameters from 'href' URI
            echo '<br>' . $arrHref['imgurl'] . '<br>';
        }
    }
}

I hope above makes sense. But please note that the above parsing might fail if Google modifies their HTML.

thanks again abhay for helping me ...you look to be a nice guy in helping others, God bless you...i trying your code and will confirm you about the result. thanks again bro.. — Zaffar Saffee, Jan 16 '12 at 14:13
Viola...it works...thanks to you buddy...thanks for the help, if i can vote you, i will do it a lot for the solution, help and guideline, thanks again — Zaffar Saffee, Jan 16 '12 at 14:17

php proDOM parsing error

1 Answers1

Linked