4

I'm developing a chromium extension so I have cross-host permissions for XMLHttpRequests for the domains I'm asking permissions for.

I have used XMLHttpRequest and got an HTML webpage (txt/html). I want to use XPath (document.evaluate) to extract relevant bits from it. Unfortunatly I'm failing to construct a DOM object from the returned string of the html.

var xhr = new XMLHttpRequest();
var name = escape("Sticks N Stones Cap");
xhr.open("GET", "http://items.jellyneo.net/?go=show_items&name="+name+"&name_type=exact", true);
xhr.onreadystatechange = function () {
    if (xhr.readyState == 4) {
    var parser = new DOMParser();
    var xmlDoc = parser.parseFromString(xhr.responseText,"text/xml");
    console.log(xmlDoc);
    }
}

xhr.send();

console.log is to display debug stuff in Chromium JS console.

In the said JS console. I get this:

Document
<html>​
<body>​
<parsererror style=​"display:​ block;​ white-space:​ pre;​ border:​ 2px solid #c77;​ padding:​ 0 1em 0 1em;​ margin:​ 1em;​ background-color:​ #fdd;​ color:​ black">​
<h3>​This page contains the following errors:​</h3>​
<div style=​"font-family:​monospace;​font-size:​12px">​error on line 1 at column 60: Space required after the Public Identifier
​</div>​
<h3>​Below is a rendering of the page up to the first error.​</h3>​
</parsererror>​
</body>​
</html>​

So how am I suppose to use XMLHttpRequest -> receive HTML -> convert to DOM -> use XPath to transverse?

Should I be using the "hidden" iframe hack for loading / receiving DOM object?

Dima
  • 2,012
  • 2
  • 17
  • 23
  • I use the IFRAME technique to load HTML in our web app. It is fast and works well even on IE8. And when you are in the DOM, you can use CSS selectors instead of Xpath. – Mic Oct 19 '10 at 21:44
  • @Mic thanks. I'll try to hack that up. It's just I'm doing screen-scrapping of data from a few pages and XPath is a true wonder =) allows you to get all the similar looking data from any table, etc. – Dima Oct 19 '10 at 21:48
  • CSS selectors are for HTML what Xpath is for XML, but a bit more human :) – Mic Oct 19 '10 at 22:03

1 Answers1

3

The DOMParser is choking on the DOCTYPE definition. It would also error on any other non-xhtml markup such as a <link> without a closing /. Do you have control over the document being sent? If not, your best bet is to parse it as a string. Use regular expressions to find what you are looking for.

Edit: You can get the browser to parse the contents of the body for you by injecting it into a hidden div:

var hidden = document.body.appendChild(document.createElement("div"));
hidden.style.display = "none";
hidden.innerHTML = /<body[^>]*>([\s\S]+)<\/body>/i(xhr.responseText)[1];

Now search inside hidden to find what you're looking for:

var myEl = hidden.querySelector("table.foo > tr > td.bar > span.fu");
var myVal = myEl.innerHTML;
gilly3
  • 87,962
  • 25
  • 144
  • 176
  • No, I don't have control over the document being sent. And I'm a bit confused. For the same page I can get `document` object, yet I can't get it if I have it passed to me as a string? – Dima Oct 19 '10 at 22:10
  • Until it is parsed by the browser, it is just a string. To get the browser to parse it, inject the html into a hidden div on the page, then search the div for whatever you are looking for. – gilly3 Oct 19 '10 at 22:23