Jsoup clean title tag failure

Question

I am using Jsoup 1.9.2 to process and clean some XML input of specific tags. During this, I noticed that Jsoup behaves strangely when it is asked to clean title tags. Specifically, other XML tags within the title tag do not get removed, and in fact get replaced by their escaped forms.

I created a short unit test for this as below. The test fails, as output comes out with the value of CuCl<sub>2</sub>.

@Test
public void stripXmlSubInTitle() {
    final String input = "<title>CuCl<sub>2</sub></title>";
    final String output = Jsoup.clean(input, Whitelist.none());
    assertEquals("CuCl2", output);
}

If the title tag is replaced with other tags (e.g., p or div), then everything works as expected. Any explanation and workaround will be appreciated.

http://stackoverflow.com/questions/8683018/jsoup-clean-without-adding-html-entities — maztt, May 31 '16 at 11:49

flavio.donze · Answer 1 · 2016-05-31T13:06:01.700

The title tag should be used within the head (or in HTML5 within the html) tag. Since it is used to display the title of the HTML document, mostly in a browser window/tab, it is not supposed to have child tags.

JSoup treats it differently than actual content tags like p or div, the same applies for textarea.

Edit:

You could do something like this:

public static void main(String[] args) {
    try {
        final String input = "<content><title>CuCl<sub>2</sub></title><othertag>blabla</othertag><title>title with no subtags</title></content>";
        Document document = Jsoup.parse(input);
        Elements titles = document.getElementsByTag("title");
        for (Element element : titles) {
            element.text(Jsoup.clean(element.ownText(), Whitelist.none()));
        }
        System.out.println(document.body().toString());
    } catch (Exception e) {
        e.printStackTrace();
    }
}

That would return:

<body>
 <content>
  <title>CuCl2</title>
  <othertag>
   blabla
  </othertag>
  <title>title with no subtags</title>
 </content>
</body>

Depending on your needs, some adjustments need to be made, e.g.

System.out.println(Jsoup.clean(document.body().toString(), Whitelist.none()));

That would return:

CuCl2  blabla  title with no subtags

Thanks! My documents are not pure HTML though, they're XML with some (incidentally) HTML tags. Can you recommend a way to avoid this (other than a regex replacement)? I like/need the Whitelist bit of Jsoup. — Claudiu, May 31 '16 at 11:48

Jsoup clean title tag failure

1 Answers1