8

I'm looking for an efficient approach to extracting a fragment of HTML from a web page and performing some specific operations on that HTML fragment.

The operations required are:

  1. Remove all tags that have a class of "hidden"
  2. Remove all script tags
  3. Remove all style tags
  4. Remove all event attributes (on*="*")
  5. Remove all style attributes

I've been using HTML Parser (org.htmlparser) for this task and have been able to meet all of the requirements, however, I don't feel that I have an elegant solution. Currently, I am parsing the web page with a CssSelectorNodeFilter (to get the fragment) and then re-parsing that fragment with a NodeVisitor in order to carry out the cleaning operations.

Could anybody suggest how they would tackle this problem? I would prefer to only parse the document once and perform all operations during that one parse.

Thanks in advance!

Kieran Hall
  • 2,637
  • 2
  • 24
  • 27

1 Answers1

10

Check out jsoup - it should handle all of your necessary tasks in an elegant way.

[Edit]

Here's a full working example per your required operations:

// Load and parse the document fragment.
File f = new File("myfile.html"); // See also Jsoup#parseBodyFragment(s)
Document doc = Jsoup.parse(f, "UTF-8", "http://example.com");

// Remove all script and style elements and those of class "hidden".
doc.select("script, style, .hidden").remove();

// Remove all style and event-handler attributes from all elements.
Elements all = doc.select("*");
for (Element el : all) { 
  for (Attribute attr : el.attributes()) { 
    String attrKey = attr.getKey();
    if (attrKey.equals("style") || attrKey.startsWith("on")) { 
      el.removeAttr(attrKey);
    } 
  }
}
// See also - doc.select("*").removeAttr("style");

You'll want to make sure things like case sensitivity don't matter for the attribute names but this should be the majority of what you need.

maerics
  • 151,642
  • 46
  • 269
  • 291