Jsoup.clean without adding html entities

Question

I'm cleaning some text from unwanted HTML tags (such as <script>) by using

String clean = Jsoup.clean(someInput, Whitelist.basicWithImages());

The problem is that it replaces for instance å with å (which causes troubles for me since it's not "pure xml").

For example

Jsoup.clean("hello å <script></script> world", Whitelist.basicWithImages())

yields

"hello &aring;  world"

but I would like

"hello å  world"

Is there a simple way to achieve this? (I.e. simpler than converting å back to å in the result.)

score 37 · Accepted Answer · edited May 11 '12 at 13:14

37

You can configure Jsoup's escaping mode: Using EscapeMode.xhtml will give you output w/o entities.

Here's a complete snippet that accepts str as input, and cleans it using Whitelist.simpleText():

// Parse str into a Document
Document doc = Jsoup.parse(str);

// Clean the document.
doc = new Cleaner(Whitelist.simpleText()).clean(doc);

// Adjust escape mode
doc.outputSettings().escapeMode(EscapeMode.xhtml);

// Get back the string of the body.
str = doc.body().html();

edited May 11 '12 at 13:14

aioobe

413,195
112
811
826

answered May 11 '12 at 12:49

bmoc

652
7
7

Note: Rather than interacting directly with a [Cleaner object](https://jsoup.org/apidocs/org/jsoup/safety/Cleaner.html), use the [clean methods](https://jsoup.org/apidocs/org/jsoup/Jsoup.html#clean-java.lang.String-java.lang.String-org.jsoup.safety.Whitelist-) in Jsoup. – Dave Jarvis Jul 02 '16 at 18:43

score 11 · Answer 2 · answered Feb 16 '12 at 15:08

11

There are already feature requests on the website of Jsoup. You can extend source code yourself by adding a new empty Map and a new escaping type. If you don't want to do this you can use StringEscapeUtils from apache commons.

public static String getTextOnlyFromHtmlText(String htmlText){
    Document doc = Jsoup.parse( htmlText );
    doc.outputSettings().charset("UTF-8");
    htmlText = Jsoup.clean( doc.body().html(), Whitelist.simpleText() );
    htmlText = StringEscapeUtils.unescapeHtml(htmlText);
    return htmlText;
}

answered Feb 16 '12 at 15:08

Frank Szilinski

550
1
5
18

1

good point with the StringEscapeUtils method Frank. Very useful, not only in this case – frandevel Dec 17 '13 at 09:32
3

@frandevel This would be a very bad idea. If the input is `<script>alert('Hello');</script>`, you will actually inject unsafe HTML and allow XSS attack. – Guillaume Polet Feb 20 '15 at 10:35
This functionality is now implemented i Jsoup. See Parser.unescapeEntities, https://jsoup.org/apidocs/org/jsoup/parser/Parser.html – Andreas Lundgren Nov 22 '17 at 11:04
@GuillaumePolet Interesting point, then How do you clean inputs like this `<script>alert('Hello');</script>` ? – bsingh Feb 26 '19 at 06:37

score 5 · Answer 3 · answered Mar 24 '17 at 08:45

5

Answer from &bmoc is working fine, but you could use a shorter solution :

// Clean html
Jsoup.clean(someInput, "yourBaseUriOrEmpty", Whitelist.simpleText(), new OutputSettings().escapeMode(EscapeMode.xhtml))

answered Mar 24 '17 at 08:45

ersefuril

809
9
16

score 2 · Answer 4 · answered Jan 06 '13 at 06:47

A simpler way to do this is

// clean the html
String output = Jsoup.clean(html, Whitelist.basicWithImages());

// Parse string into a document
Document doc = Jsoup.parse(output);

// Adjust escape mode
doc.outputSettings().escapeMode(EscapeMode.xhtml);

// Get back the string
System.out.println(doc.body().html());

I have tested this and it works

kapex · Answer 5 · 2014-02-03T12:28:31.667

The accepted answer is using Jsoup.parse which seems more heavyweight than what is going on in Jsoup.clean after a quick glance at the source.

I copied the source code of Jsoup.clean(...) and added the line to set the escape mode. This should avoid some unecessary steps done by the parse method because it doesn't have to parse a whole html document but just handle a fragment.

private String clean(String html, Whitelist whitelist) {
    Document dirty = Jsoup.parseBodyFragment(html, "");
    Cleaner cleaner = new Cleaner(whitelist);
    Document clean = cleaner.clean(dirty);
    clean.outputSettings().escapeMode(EscapeMode.xhtml);
    return clean.body().html();
}

Diego Queres · Answer 6 · 2015-06-29T20:28:44.847

1

Simple way:

EscapeMode em = EscapeMode.xhtml;
em.getMap().clear();

doc.outputSettings().escapeMode(em);

This will remove ALL html entities, including these: ', ", & ,< and >. The EscapeMode.xhtml allows these entities.

edited Jun 29 '15 at 20:28

answered Jun 26 '15 at 21:17

Diego Queres

11
2

score 0 · Answer 7 · answered Dec 30 '11 at 19:20

Parse the HTML as a Document, then use a Cleaner to clean the document and generate another one, get the outputSettings of the document and set the appropriate charset and the escape mode to xhtml, then transform the document to a String. Not tested, but should work.

Jsoup.clean without adding html entities

7 Answers7

Linked