Extract and Clean HTML Fragment using HTML Parser (org.htmlparser)

Question

I'm looking for an efficient approach to extracting a fragment of HTML from a web page and performing some specific operations on that HTML fragment.

The operations required are:

Remove all tags that have a class of "hidden"
Remove all script tags
Remove all style tags
Remove all event attributes (on*="*")
Remove all style attributes

I've been using HTML Parser (org.htmlparser) for this task and have been able to meet all of the requirements, however, I don't feel that I have an elegant solution. Currently, I am parsing the web page with a CssSelectorNodeFilter (to get the fragment) and then re-parsing that fragment with a NodeVisitor in order to carry out the cleaning operations.

Could anybody suggest how they would tackle this problem? I would prefer to only parse the document once and perform all operations during that one parse.

Thanks in advance!

maerics · Accepted Answer · 2011-12-05T15:06:05.873

Check out jsoup - it should handle all of your necessary tasks in an elegant way.

[Edit]

Here's a full working example per your required operations:

// Load and parse the document fragment.
File f = new File("myfile.html"); // See also Jsoup#parseBodyFragment(s)
Document doc = Jsoup.parse(f, "UTF-8", "http://example.com");

// Remove all script and style elements and those of class "hidden".
doc.select("script, style, .hidden").remove();

// Remove all style and event-handler attributes from all elements.
Elements all = doc.select("*");
for (Element el : all) { 
  for (Attribute attr : el.attributes()) { 
    String attrKey = attr.getKey();
    if (attrKey.equals("style") || attrKey.startsWith("on")) { 
      el.removeAttr(attrKey);
    } 
  }
}
// See also - doc.select("*").removeAttr("style");

You'll want to make sure things like case sensitivity don't matter for the attribute names but this should be the majority of what you need.

I will take a look at jsoup. If it provides a better framework for solving my problem, then I shall submit an answer advocating it's use for my requirements. Thanks for the tip. — Kieran Hall, Dec 05 '11 at 09:01

Extract and Clean HTML Fragment using HTML Parser (org.htmlparser)

1 Answers1

Linked