Get title tag from html page using XPath?

Question

I have two pages that Im trying to extract the title tag from using an Xpath query. This page works: http://www.hobbyfarms.com/farm-directory/category-home-and-barn-resources-1.aspx

This page doesn't: http://cattletoday.com/links/Barns_and_Metal_Buildings/page-1.html?s=A

Here's my code:

$dom = new DOMDocument();
@$dom->loadHTMLFile($href);
$xpath = new DOMXPath($dom);

$titleNode = $xpath->query("//title");
foreach ($titleNode as $n) {
    $pageTitle = $n->nodeValue;
}

I've also tried this:

$xpath->query('//title')->item(0)->textContent

But it doesnt work for the one URL either.

Does anyone see why this is occurring? And hopefully have a solution.

score 4 · Accepted Answer · answered Mar 01 '13 at 18:48

4

File is Gzipped, the following script works:

$href = 'http://cattletoday.com/links/Barns_and_Metal_Buildings/page-1.html?s=A';
$dom = new DOMDocument();
$file = gzdecode(file_get_contents($href));
$dom->loadHTML($file);
$xpath = new DOMXPath($dom); 
$titleNode = $xpath->query('//title');
var_dump($titleNode->item(0));

(notice the gzdecode function used)

answered Mar 01 '13 at 18:48

Asaf

8,106
19
66
116

+1 The `gzdecode` function is GOD GIVEN.... I spend days trying to solve this but that function did the trick in a bit of a second – ErickBest Sep 09 '21 at 10:44

MiMo · Answer 2 · 2013-02-28T20:30:54.260

2

The second page uses the XHTML namespace, and so you have to use XPath's qualified with that namespace:

$xpath->registerNamespace("xhtml", "http://www.w3.org/1999/xhtml");
$titleNode = $xpath->query("//xhtml:title|//title");

edited Feb 28 '13 at 20:30

answered Feb 28 '13 at 19:52

MiMo

11,793
1
33
48

Im trying every variation of register name space that I can find and I cant get it to work :( There are no errors it just doesn't retrieve the element. That or the element it gets doesn't have a nodevalue. – RachelD Mar 01 '13 at 18:31
@RachelD: the namespace declaration and XPath in my answers seem OK - I just tested them, but the page that does not work for you contains HTML that is NOT valid XML - I had to clean it up to use an XSLT processor to test the XPath. I don't know how PHP handles HTML that is not valid XML (I am not testing using PHP) - maybe that's the source of your problem. – MiMo Mar 01 '13 at 19:01
I think your right I'm just not sure how to get around it. I was able to get all the titles by pulling the page using curl. However I'm also trying to scrape backlink's to my site and that's having the same issue :( – RachelD Mar 01 '13 at 19:32

Get title tag from html page using XPath?

2 Answers2