3

I'm working with a number of malformed HTML pages. At least, I presume they're malformed because when I parse them in Nokogiri and then execute to_html, elements don't appear correctly anymore. When I parse them with Hpricot, however, they display correctly.

I'd rather not use Hpricot because it appears to be impossible to add Hpricot::Elem instances to a document (without converting them to strings, adding, then parsing again).

Can I disable Nokogiri's error correction so that I can preserve the HTML closer to the way it was written?

JellicleCat
  • 28,480
  • 24
  • 109
  • 162
  • Great question. I have yet to figure out the differences. Until I do, you can see the original html at (http://pastie.org/2638305) and the `nokogiri.to_html` code at (http://pastie.org/2638308). – JellicleCat Oct 04 '11 at 15:27
  • I have found that Hpricot's 'error correction' messes up one of our sites by enforcing the don't-wrap-block-elements-within-inline-elements rule, i.e. extracting the block element (wrapped) and putting it after the inline element (wrapper). Setting `:fixup_tags` and `:xhtml_strict` to false does not prevent this behaviour. – JellicleCat Oct 04 '11 at 15:33
  • 1
    Is your HTML valid XML? If it is then you might be able to `Nokogiri::XML()` it (or the Hpricot equivalent) and the nesting rules wouldn't apply. – mu is too short Oct 04 '11 at 22:13
  • Excellent idea. Alas, this made no difference. (And I ran the page through an w3.org's validator w/ doctype xhtml to ensure that I had valid xml.) – JellicleCat Oct 05 '11 at 15:50

1 Answers1

2

Your XHTML is not valid XHTML. If I copy the contents from http://pastie.org/2638305, save them as 'foo.xhtml' and then attempt to open them in Chrome, I see:

This page contains the following errors:
error on line 768 at column 39: attributes construct error

If I look on line 768 then I see (truncated):

<img src="..." alt="Talk to us now!"http://wholesaleinsurance.net/>

As you can see, that is clearly not syntactically valid.

You claim that you ran the page through validator.w3.org, but when I do that with the contents of your pastie I get:

Errors found while checking this document as XHTML 1.0 Strict!
Result: 15 Errors, 3 warning(s)

So...is your actual content not what you put in the pastie?

Phrogz
  • 296,393
  • 112
  • 651
  • 745
  • You're right about line 768, and I did see errors in we.org's validator, but when I examined them, they were all (I think) bogus. I disregarded one's that said 'there is no attribute X' and indicated code that didn't appear in the doc. e.g. ` – JellicleCat Oct 20 '11 at 16:39
  • @JellicleCat I suppose you'd have to find them and fix them. These are errors so egregious that Nokogiri (or rather, libxml2) does not know how to handle the soup you're pouring into it. How about a couple `gsub`/regex for the known errors, to make the content suck less? – Phrogz Oct 20 '11 at 16:43