0

I am using Jsoup 1.9.2 to process and clean some XML input of specific tags. During this, I noticed that Jsoup behaves strangely when it is asked to clean title tags. Specifically, other XML tags within the title tag do not get removed, and in fact get replaced by their escaped forms.

I created a short unit test for this as below. The test fails, as output comes out with the value of CuCl<sub>2</sub>.

@Test
public void stripXmlSubInTitle() {
    final String input = "<title>CuCl<sub>2</sub></title>";
    final String output = Jsoup.clean(input, Whitelist.none());
    assertEquals("CuCl2", output);
}

If the title tag is replaced with other tags (e.g., p or div), then everything works as expected. Any explanation and workaround will be appreciated.

Claudiu
  • 356
  • 2
  • 8

1 Answers1

0

The title tag should be used within the head (or in HTML5 within the html) tag. Since it is used to display the title of the HTML document, mostly in a browser window/tab, it is not supposed to have child tags.

JSoup treats it differently than actual content tags like p or div, the same applies for textarea.

Edit:

You could do something like this:

public static void main(String[] args) {
    try {
        final String input = "<content><title>CuCl<sub>2</sub></title><othertag>blabla</othertag><title>title with no subtags</title></content>";
        Document document = Jsoup.parse(input);
        Elements titles = document.getElementsByTag("title");
        for (Element element : titles) {
            element.text(Jsoup.clean(element.ownText(), Whitelist.none()));
        }
        System.out.println(document.body().toString());
    } catch (Exception e) {
        e.printStackTrace();
    }
}

That would return:

<body>
 <content>
  <title>CuCl2</title>
  <othertag>
   blabla
  </othertag>
  <title>title with no subtags</title>
 </content>
</body>

Depending on your needs, some adjustments need to be made, e.g.

System.out.println(Jsoup.clean(document.body().toString(), Whitelist.none()));

That would return:

CuCl2  blabla  title with no subtags
flavio.donze
  • 7,432
  • 9
  • 58
  • 91
  • Thanks! My documents are not pure HTML though, they're XML with some (incidentally) HTML tags. Can you recommend a way to avoid this (other than a regex replacement)? I like/need the Whitelist bit of Jsoup. – Claudiu May 31 '16 at 11:48