29

I'm cleaning some text from unwanted HTML tags (such as <script>) by using

String clean = Jsoup.clean(someInput, Whitelist.basicWithImages());

The problem is that it replaces for instance å with &aring; (which causes troubles for me since it's not "pure xml").

For example

Jsoup.clean("hello å <script></script> world", Whitelist.basicWithImages())

yields

"hello &aring;  world"

but I would like

"hello å  world"

Is there a simple way to achieve this? (I.e. simpler than converting &aring; back to å in the result.)

Charles
  • 50,943
  • 13
  • 104
  • 142
aioobe
  • 413,195
  • 112
  • 811
  • 826

7 Answers7

37

You can configure Jsoup's escaping mode: Using EscapeMode.xhtml will give you output w/o entities.

Here's a complete snippet that accepts str as input, and cleans it using Whitelist.simpleText():

// Parse str into a Document
Document doc = Jsoup.parse(str);

// Clean the document.
doc = new Cleaner(Whitelist.simpleText()).clean(doc);

// Adjust escape mode
doc.outputSettings().escapeMode(EscapeMode.xhtml);

// Get back the string of the body.
str = doc.body().html();
aioobe
  • 413,195
  • 112
  • 811
  • 826
bmoc
  • 652
  • 7
  • 7
  • Note: Rather than interacting directly with a [Cleaner object](https://jsoup.org/apidocs/org/jsoup/safety/Cleaner.html), use the [clean methods](https://jsoup.org/apidocs/org/jsoup/Jsoup.html#clean-java.lang.String-java.lang.String-org.jsoup.safety.Whitelist-) in Jsoup. – Dave Jarvis Jul 02 '16 at 18:43
11

There are already feature requests on the website of Jsoup. You can extend source code yourself by adding a new empty Map and a new escaping type. If you don't want to do this you can use StringEscapeUtils from apache commons.

public static String getTextOnlyFromHtmlText(String htmlText){
    Document doc = Jsoup.parse( htmlText );
    doc.outputSettings().charset("UTF-8");
    htmlText = Jsoup.clean( doc.body().html(), Whitelist.simpleText() );
    htmlText = StringEscapeUtils.unescapeHtml(htmlText);
    return htmlText;
}
Frank Szilinski
  • 550
  • 1
  • 5
  • 18
  • 1
    good point with the StringEscapeUtils method Frank. Very useful, not only in this case – frandevel Dec 17 '13 at 09:32
  • 3
    @frandevel This would be a very bad idea. If the input is `<script>alert('Hello');</script>`, you will actually inject unsafe HTML and allow XSS attack. – Guillaume Polet Feb 20 '15 at 10:35
  • This functionality is now implemented i Jsoup. See Parser.unescapeEntities, https://jsoup.org/apidocs/org/jsoup/parser/Parser.html – Andreas Lundgren Nov 22 '17 at 11:04
  • @GuillaumePolet Interesting point, then How do you clean inputs like this `<script>alert('Hello');</script>` ? – bsingh Feb 26 '19 at 06:37
5

Answer from &bmoc is working fine, but you could use a shorter solution :

// Clean html
Jsoup.clean(someInput, "yourBaseUriOrEmpty", Whitelist.simpleText(), new OutputSettings().escapeMode(EscapeMode.xhtml))
ersefuril
  • 809
  • 9
  • 16
2

A simpler way to do this is

// clean the html
String output = Jsoup.clean(html, Whitelist.basicWithImages());

// Parse string into a document
Document doc = Jsoup.parse(output);

// Adjust escape mode
doc.outputSettings().escapeMode(EscapeMode.xhtml);

// Get back the string
System.out.println(doc.body().html());

I have tested this and it works

Girish
  • 191
  • 3
  • 10
2

The accepted answer is using Jsoup.parse which seems more heavyweight than what is going on in Jsoup.clean after a quick glance at the source.

I copied the source code of Jsoup.clean(...) and added the line to set the escape mode. This should avoid some unecessary steps done by the parse method because it doesn't have to parse a whole html document but just handle a fragment.

private String clean(String html, Whitelist whitelist) {
    Document dirty = Jsoup.parseBodyFragment(html, "");
    Cleaner cleaner = new Cleaner(whitelist);
    Document clean = cleaner.clean(dirty);
    clean.outputSettings().escapeMode(EscapeMode.xhtml);
    return clean.body().html();
}
kapex
  • 28,903
  • 6
  • 107
  • 121
1

Simple way:

EscapeMode em = EscapeMode.xhtml;
em.getMap().clear();

doc.outputSettings().escapeMode(em);

This will remove ALL html entities, including these: ', ", & ,< and >. The EscapeMode.xhtml allows these entities.

0

Parse the HTML as a Document, then use a Cleaner to clean the document and generate another one, get the outputSettings of the document and set the appropriate charset and the escape mode to xhtml, then transform the document to a String. Not tested, but should work.

JB Nizet
  • 678,734
  • 91
  • 1,224
  • 1,255