15

I'm looking for a html sanitizer which I can call per API to sanitise strings which I get from my webapp. Are there some useful easy to use libs available? Does anyone knows maybe one or two?

I don't need something big it just must be able to find unclosed tags and close them.

onigunn
  • 4,730
  • 10
  • 58
  • 89

5 Answers5

25

https://github.com/OWASP/java-html-sanitizer is now marked ready for production use.

A fast and easy to configure HTML Sanitizer written in Java which lets you include HTML authored by third-parties in your web application while protecting against XSS.

You can use prepackaged policies

Sanitizers.FORMATTING.and(Sanitizers.LINKS)

or the tests show how you can configure your own easily:

new HtmlPolicyBuilder()
    .allowElements("a")
    .allowUrlProtocols("https")
    .allowAttributes("href").onElements("a")
    .requireRelNofollowOnLinks()

or write custom policies to do things like changing h1s to divs with a certain class:

new HtmlPolicyBuilder()
    .allowElements("h1", "p")
    .allowElements(
        new ElementPolicy() {
          public String apply(String elementName, List<String> attrs) {
            attrs.add("class");
            attrs.add("header-" + elementName);
            return "div";
          }
        }, "h1"))
Mike Samuel
  • 118,113
  • 30
  • 216
  • 245
  • This library makes a good first impression: Well documented and a clean API. – Sven Jacobs Jun 28 '13 at 05:21
  • I use this library but it removes embedded iframes as well. Is there any way to allow adding iframes, I have genuine use cases like adding embedding a youtube video or slideshare presentation. How could I allow such embedded iframes ? – Rajat Gupta Aug 16 '14 at 16:09
  • 1
    @usero1, Yes, you can `allowElements("iframe")`. – Mike Samuel Aug 16 '14 at 20:39
  • Thanks so much Mike! but the input html is like : *`
    `* & in the sanitized output I get all the attributes stripped out. How do I prevent the attributes from being stripped out? Btw Is it safe to allow iframes as such with all these attributes ?
    – Rajat Gupta Aug 17 '14 at 11:13
  • The sanitized output I get for above input is : *`
    `* which is of no use.. Thanks!
    – Rajat Gupta Aug 17 '14 at 11:56
  • 1
    @user01, You probably have to allow the `src` attribute with a value that you approve. See the documentation for the HTML policy builder class. – Mike Samuel Aug 19 '14 at 13:50
10

JTidy may help you.

Jerome
  • 8,427
  • 2
  • 32
  • 41
3

The HTML Parser JSoup also supports sanitisation by policy: http://jsoup.org/cookbook/cleaning-html/whitelist-sanitizer

eckes
  • 10,103
  • 1
  • 59
  • 71
2

Apart from JTidy you can also take a look at:
Nekohtml
TagSoup
Getting text in HTmL document

Samuh
  • 36,316
  • 26
  • 109
  • 116