How do I make pQuery work with slightly malformed HTML?

Question

pQuery is a pragmatic port of the jQuery JavaScript framework to Perl which can be used for screen scraping.

pQuery quite sensitive to malformed HTML. Consider the following example:

use pQuery;

my $html_malformed = "<html><head><title>foo</title></head><body>bar</body></html>>";
my $page = pQuery($html_malformed);
my $title = $page->find("title");
print "The title is: ", $title->html, "\n";

pQuery won't find the title tag in the example above due to the double ">>" in the malformed HTML.

To make my pQuery based applications more tolerant to malformed HTML I need to pre-process the HTML by cleaning it up before passing it to pQuery.

Starting with the code fragment given above, what is the most robust pure-perl way to clean-up the HTML to make it parse:able by pQuery?

score 4 · Accepted Answer · answered Oct 09 '10 at 19:27

I'd report this as a bug in pQuery. Here's a workaround:

use HTML::TreeBuilder;
use pQuery;

my $html_malformed = "<html><head><title>foo</title></head><body>bar</body></html>>";
my $html_cleaned = HTML::TreeBuilder->new_from_content($html_malformed);
my $page = pQuery($html_cleaned->as_HTML);
$html_cleaned->delete;
my $title = $page->find("title");
print "The title is: ", $title->html, "\n";

This doesn't make a lot of sense, since pQuery already uses HTML::TreeBuilder as its underlying parsing mechanism, but it does work.

score 2 · Answer 2 · answered Oct 09 '10 at 15:47

2

Try HTML::Tidy, which fixes invalid HTML.

answered Oct 09 '10 at 15:47

lonesomeday

233,373
50
316
318

Sorry, but I need a pure-perl solution. It has now been clarified in the question. Thanks for the answer anyways! :-) – knorv Oct 09 '10 at 15:53

score -1 · Answer 3 · answered Oct 09 '10 at 16:00

-1

is that what you want?

$html_malformed =~ r|<+(<.*?>)>+|$1|g;

answered Oct 09 '10 at 16:00

Flo Edelmann

2,573
1
20
33

No, that would only catch the example given. I'm looking for a more general solution. – knorv Oct 09 '10 at 16:11

How do I make pQuery work with slightly malformed HTML?

3 Answers3