3

I am working on web scraping application using simple_html_dom. I need to extract all the images in a web page. The following are the possibilities:

  1. <img> tag images
  2. if there is a css with the <style> tag in the same page.
  3. if there is an image with the inline style with <div> or with some other tag.

I can scrape all the images by using the following code.

function download_images($html, $page_url , $local_url){

    foreach($html->find('img') as $element) {
        $img_url = $element->src;
        $img_url = rel2abs($img_url, $page_url);
        $parts   = parse_url($img_url);
        $img_path=  $parts['path'];
        $url_to_be_change = $GLOBALS['website_server_root'].$img_path;
        download_file($img_url, $GLOBALS['website_local_root'].$img_path);  
        $element->src=$url_to_be_change;            
    }

    $css_inline = $html->find("style");

    $matches = array();
    preg_match_all( "/url\((.*?)\)/", $css_inline, $matches, PREG_SET_ORDER );
    foreach ( $matches as $match )    {
        $img_url = trim( $match[1], "\"'" );
        $img_url = rel2abs($img_url, $page_url);
        $parts   = parse_url($img_url);
        $img_path=  $parts['path'];
        $url_to_be_change = $GLOBALS['website_server_root'].$img_path  ;
        download_file($img_url , $GLOBALS['website_local_root'].$img_path); 
        $html = str_replace($img_url , $url_to_be_change , $html );
    }

    return $html;
}

$html = download_images($html , $page_url , $dir); // working fine
$html = str_get_html ($html);
$html->save($dir. "/" . $ff);    

Please note that, I am modifying the HTML too after image downloading.

downloading is working fine. but when i am trying to save the HTML, then its giving the following error:

PHP Fatal error: Cannot use object of type simple_html_dom as array

Important: its working perfectly fine, if I am not using str_replace and second loop.

Fatal error: Cannot use object of type simple_html_dom as array in /var/www/html/app/framework/cache/includes/simple_html_dom.php on line 1167

user2674341
  • 277
  • 2
  • 6
  • 15
  • The $html as the last argument in your str_replace call is an object, not an array. str_replace apparently doesn't like that. You need to figure out another way to represent that data as an array, or re-work it somehow. – Tech Savant Apr 30 '15 at 12:39
  • obligatory http://stackoverflow.com/a/1732454/3044080 – nomistic Apr 30 '15 at 13:56

3 Answers3

2

Guess №1

I see a possible mistake here:

$html = str_get_html($html);

Looks like you pass an object to function str_get_html(), while it accepts a string as an argument. Lets fix that this way:

$html = str_get_html($html->plaintext);

We can only guess what is the content of the $html variable, that comes to this piece of code.

Guess №2

Or maybe we just need to use another variable in function download_images to make your code correct in both cases:

function download_images($html, $page_url , $local_url){

    foreach($html->find('img') as $element) {
        $img_url = $element->src;
        $img_url = rel2abs($img_url, $page_url);
        $parts   = parse_url($img_url);
        $img_path=  $parts['path'];
        $url_to_be_change = $GLOBALS['website_server_root'].$img_path  ;
        download_file($img_url , $GLOBALS['website_local_root'].$img_path); 
        $element->src=$url_to_be_change;            
    }

    $css_inline = $html->find("style");

    $result_html = "";
    $matches = array();
    preg_match_all( "/url\((.*?)\)/", $css_inline, $matches, PREG_SET_ORDER );
    foreach ( $matches as $match )    {
        $img_url = trim( $match[1], "\"'" );
        $img_url = rel2abs($img_url, $page_url);
        $parts   = parse_url($img_url);
        $img_path=  $parts['path'];
        $url_to_be_change = $GLOBALS['website_server_root'].$img_path  ;
        download_file($img_url , $GLOBALS['website_local_root'].$img_path); 
        $result_html = str_replace($img_url , $url_to_be_change , $html );
    }

    return $result_html;
}

$html = download_images($html , $page_url , $dir); // working fine
$html = str_get_html ($html);
$html->save($dir. "/" . $ff);

Explanation: if there was no matches (array $matches is empty) we never go in the second cycle, thats why variable $html still has the same value as at beginning of the function. This is common mistake when you're trying to use same variable in the place of code where you need two different variables.

Andrew Surzhynskyi
  • 2,726
  • 1
  • 22
  • 32
  • line 1167 : if ($this->size>0) $this->char = $this->doc[0]; – user2674341 Apr 30 '15 at 12:47
  • 1
    Updated my answer. Added one more solution (see Guess №2 part). Please tell me which one of those two works in all the cases. – Andrew Surzhynskyi Apr 30 '15 at 13:05
  • now, its showing this error.i cant see your second solution. PHP Fatal error: Call to a member function save() on a non-object in – user2674341 Apr 30 '15 at 13:12
  • Ah, that is okay, look at the last two lines: `$html = str_get_html ($html);` here we save a string to $html variable, and the last one `$html->save($dir. "/" . $ff);` we are still trying to use it as an object, but it is string now! You should fix it to make your program work as intended, I can't help you, because I only know a small part of the code, not all the program. Hope this explanation will help you fix it. – Andrew Surzhynskyi Apr 30 '15 at 13:20
  • i have tried the second solution, the old error has removed, but i cant save the html. here is the error: Fatal error: Call to a member function save() on a non-object in – user2674341 Apr 30 '15 at 13:26
  • I explained that in my previous comment, please read it once again. – Andrew Surzhynskyi Apr 30 '15 at 13:27
0

As the error message states, you are dealing with an Object where you should have an array. You could try tpyecasting your object:

$array =  (array) $yourObject;

That should solve it.

Burki
  • 1,188
  • 19
  • 28
0

I had this error, I solved it by using (in my case) return $html->save(); in end of function. I can't explain why two instances with different variable names, and scoped in different functions made this error. I guess this is how the "simple html dom" class works.

So just to be clear, try: $html->save(), before you do anything else after

I hope this information helps somebody :)

larsmqller
  • 21
  • 5