2

I have two pages that Im trying to extract the title tag from using an Xpath query. This page works: http://www.hobbyfarms.com/farm-directory/category-home-and-barn-resources-1.aspx

This page doesn't: http://cattletoday.com/links/Barns_and_Metal_Buildings/page-1.html?s=A

Here's my code:

$dom = new DOMDocument();
@$dom->loadHTMLFile($href);
$xpath = new DOMXPath($dom);

$titleNode = $xpath->query("//title");
foreach ($titleNode as $n) {
    $pageTitle = $n->nodeValue;
}

I've also tried this:

$xpath->query('//title')->item(0)->textContent

But it doesnt work for the one URL either.

Does anyone see why this is occurring? And hopefully have a solution.

RachelD
  • 4,072
  • 9
  • 40
  • 68

2 Answers2

4

File is Gzipped, the following script works:

$href = 'http://cattletoday.com/links/Barns_and_Metal_Buildings/page-1.html?s=A';
$dom = new DOMDocument();
$file = gzdecode(file_get_contents($href));
$dom->loadHTML($file);
$xpath = new DOMXPath($dom); 
$titleNode = $xpath->query('//title');
var_dump($titleNode->item(0));

(notice the gzdecode function used)

Asaf
  • 8,106
  • 19
  • 66
  • 116
  • +1 The `gzdecode` function is GOD GIVEN.... I spend days trying to solve this but that function did the trick in a bit of a second – ErickBest Sep 09 '21 at 10:44
2

The second page uses the XHTML namespace, and so you have to use XPath's qualified with that namespace:

$xpath->registerNamespace("xhtml", "http://www.w3.org/1999/xhtml");
$titleNode = $xpath->query("//xhtml:title|//title");
MiMo
  • 11,793
  • 1
  • 33
  • 48
  • Im trying every variation of register name space that I can find and I cant get it to work :( There are no errors it just doesn't retrieve the element. That or the element it gets doesn't have a nodevalue. – RachelD Mar 01 '13 at 18:31
  • @RachelD: the namespace declaration and XPath in my answers seem OK - I just tested them, but the page that does not work for you contains HTML that is NOT valid XML - I had to clean it up to use an XSLT processor to test the XPath. I don't know how PHP handles HTML that is not valid XML (I am not testing using PHP) - maybe that's the source of your problem. – MiMo Mar 01 '13 at 19:01
  • I think your right I'm just not sure how to get around it. I was able to get all the titles by pulling the page using curl. However I'm also trying to scrape backlink's to my site and that's having the same issue :( – RachelD Mar 01 '13 at 19:32