How can I get Hpricot to play nice with HTML5?

Question

I am using Hpricot to parse a theme file. I have noticed, however, that if I feed a valid HTML5 document into Hpricot(), it auto-closes HTML5 tags (like <section>), and messes with the DOCTYPE.

Are there any extensions to Hpricot, or perhaps a flag I need to set, that will allow HTML5 documents to be parsed correctly?

It also has issues with self-closing img tags. See my post http://stackoverflow.com/questions/4220795 — AntonAL, Nov 19 '10 at 00:40
Could you add a small example of a file you are trying to parse, demonstrating the problem? — philosodad, Jan 07 '11 at 13:56
Is there a reason you need to use Hpricot, as opposed to Nokogiri? The latter is actively developed/maintained and has become a very standard part of the ruby toolkit for these sorts of things. — Bill Dueber, Jan 28 '11 at 02:05

score 2 · Answer 1 · answered Jan 30 '11 at 07:21

I know it kind of works around the direct question but I would suggest you try Nokogiri http://nokogiri.org/ as mentioned in some of the comments on your question post. I've had no issues with it parsing any HTML/XML like structured text, including HTML5.

score 0 · Answer 2 · answered Feb 24 '11 at 21:58

0

I think Hpricot's to_original_html method is exactly what you're looking for.

From the docs, to_original_html

Attempts to preserve the original HTML of the document, only outputing new tags for elements which have changed.

answered Feb 24 '11 at 21:58

nil

1,192
9
12

How can I get Hpricot to play nice with HTML5?

2 Answers2