3

I'm approaching web programming. I need to retrieve some informations from a web page. I have the url of the page, so I want the html source code, translate it into xml and then use the dom functions of php to fetch the informations I need.

My php code is this:

$url=$_POST['url']; //url

$doc_html=new DOMDocument();
$doc_html->loadHTML($url); //html page
$doc_xml=new DOMDocument();
$doc_xml->loadXML($doc_html->saveXML()); //xml converted page

$nome_app=new DOMElement($doc_xml->getElementById('title'));

echo $nome_app->nodeValue;

I get this fatal error:

Uncaught exception 'DOMException' with message 'Invalid Character Error' on this line:

$nome_app=new DOMElement($doc_xml->getElementById('title'));

What's wrong? Is it the entire process html-to-xml? I found some example on the web and should work... Thanks!

air4x
  • 5,618
  • 1
  • 23
  • 36
esseara
  • 834
  • 5
  • 27
  • 47

4 Answers4

2

Solved! Simply:

$doc_html=new DOMDocument();
$doc_html->loadHTML(file_get_contents($url));
$doc_html->saveXML();
$nome = $doc_html->getElementsByTagName('h1');
foreach ($nome as $n) { 
   echo $n->nodeValue, PHP_EOL;
}

Maybe the code was too messy before. Thanks everybody for the answers!

esseara
  • 834
  • 5
  • 27
  • 47
1

You need to define XML entities for the special characters that you're using in your HTML. It must be the same kind of problem than here: DOMDocument::loadXML vs. HTML Entities

Community
  • 1
  • 1
Bgi
  • 2,513
  • 13
  • 12
1

I would go for a preg_match() solution to get the content you need over parsing the whole document as XML. Specially if the document becomes invalid for some reason you won't get your info anymore.

floriank
  • 25,546
  • 9
  • 42
  • 66
  • You, and @Bgi, are right, but this is my situation: I have a huge source code and I don't know the DTD's that an XML file need. Parsing and correcting the whole document it's gonna be useless because I only need some html content, and a way to retrieve these without parsing a very long string, hence the use of DOM. – esseara Oct 30 '12 at 22:57
0

best way is to use xpath queries,

http://php.net/manual/en/simplexmlelement.xpath.php

it is very fast

doniyor
  • 36,596
  • 57
  • 175
  • 260
  • Was my second thought, but I prefer to use DOM because the source is very long and mazy, so I'm better using the tag names :) – esseara Oct 30 '12 at 23:00