1

I am trying to retrieve all the images from this URL http://www.homegate.ch/kaufen/105652197?3. I am using Xpaths in PHP. For some reason I can retrieve the body with Xpath but not the images. Here is my script:

<?php

$url = "http://www.homegate.ch/kaufen/105652197?3";

$body = '//body';
$img = '//img';

$html = file_get_contents($url);

# Call htmlentities as the $url content is not well-formatted: http://stackoverflow.com/questions/1685277/warning-domdocumentloadhtml-htmlparseentityref-expecting-in-entity
$html = htmlentities($html);

$dom = new DOMDocument();
$dom->loadHTML($html);

$xpath = new DomXPath($dom);

$query = $xpath->query($body);

if($query->length == 1)
    echo $query->item(0)->nodeValue;

if($query->length < 1)
    echo "Xpath for body is no good!";

$query = $xpath->query($img);

if($query->length == 1)
    echo $query->item(0)->nodeValue;

if($query->length < 1)
    echo "Xpath for image is no good!";

Running this script returns:

1. <!DOCTYPE html>..
2. Xpath for image is no good!

What is going wrong here? - Why is the Xpath only working on body and not on img

user1965074
  • 367
  • 5
  • 16

1 Answers1

0

You have to remove this line:

$html = htmlentities( $html );

To avoid DOM Warnings, use this syntax instead:

$dom = new DOMDocument();
libxml_use_internal_errors( True );         # <-------
$dom->loadHTML( $html );

With your syntax, //body XPath query apparently is ok, but with this the content:

<!DOCTYPE html>
<html prefix="og: http://ogp.me/ns#" class="no-js unknown unknown" lang="de">
<head><script type="text/javascript" src="/ver-20160426133955/assets/js/jquery.js"></script>
(...)

that, clearly, is not the body!

fusion3k
  • 11,568
  • 4
  • 25
  • 47