Remove desired tag from html using JTidy

Question

I am using JTidy and xpath in parsing HTML, but for the time being parsing text causes me a little trouble because it may include b tag inside, so I don't want to loop over it's child nodes but simply remove 'b' tags after it loads html.

How can I delete tags if from DOM document.

Document doc = tidy.parseDOM(url.openStream(), System.out);

for example pseudo code for it - doc.removeTag('<b>');

Is it possible ?

here is a list of configurable options http://tidy.sourceforge.net/docs/quickref.html that enables to replace b with strong but these are options. Can we override some of them ? — Suhrob Samiev, Apr 09 '13 at 08:26

score 0 · Answer 1 · answered Apr 09 '13 at 10:29

You have tagged this with 'jdom', but your document is a DOM document (not JDOM).

Of corse, if it was JDOM, you could replace the Elements with their content using a relatively simple document scan. Or, you can use a custom SAXHandler to skip adding the Element in the first place.

Using JDOM, you could, for example, do something like:

for (Iterator <Content> it = document.getDescendants(); it.hasNext(); ) {
  Content c = it.next();
  if ((c instanceof Element) && "b".equals(((Element)c).getName())) {
    Element e = (Element)c;
    it.remove();
    for (Content k : e.getContent()) {
      k.detach();
      it.add(k);
    }
  }
}

Remove desired tag from html using JTidy

1 Answers1