0

I am scraping html using HtmlUnit but the html is malformed with few tags as unclosed and thus HtmlUnit is giving wrong results.So I need to clean it before passing it to HtmlUnit.

How can I do that.

A short code snippet or tutorial would be appreciated

Naveen
  • 7,944
  • 12
  • 78
  • 165

1 Answers1

0

I believe you could do this by implementing your own WebConnectionWrapper. Then you'll have to find some HTML library that fixes this properly (if possible). All you should do then is making sure the wrapper sends the content to the library so that when it reaches HTMLUnit's parser the HTML content is already processed.

Mosty Mostacho
  • 42,742
  • 16
  • 96
  • 123