Convert HTML Character Back to Text Using Java Standard Library

Question

I would like to convert some HTML characters back to text using Java Standard Library. I was wondering whether any library would achieve my purpose?

/**
 * @param args the command line arguments
 */
public static void main(String[] args) {
    // TODO code application logic here

    // "Happy & Sad" in HTML form.
    String s = "Happy &amp; Sad";
    System.out.println(s);

    try {
        // Change to "Happy & Sad". DOESN'T WORK!
        s = java.net.URLDecoder.decode(s, "UTF-8");
        System.out.println(s);
    } catch (UnsupportedEncodingException ex) {

    }
}

score 60 · Accepted Answer · edited Aug 30 '19 at 07:50

60

I think the Apache Commons Lang library's StringEscapeUtils.unescapeHtml3() and unescapeHtml4() methods are what you are looking for. See https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html.

edited Aug 30 '19 at 07:50

jsheeran

2,912
2
17
32

answered Mar 01 '09 at 11:46

Bill.D

166
3
5

1

Up to date url: http://commons.apache.org/lang/api-2.6/org/apache/commons/lang/StringEscapeUtils.html – Reu Nov 23 '11 at 16:57
1

Not to beat a dead horse, but what the OP was asking for was how to translate between HTML entities and "plain" text (which is ASCII for me, but YMMV). The Jakarta lib above has unescapeHTML (and escapeHTML), which does the trick. URLDecoder still works for percent-encoding URL strings (like GET parameters). – jjohn Jun 14 '12 at 18:20
How same will support in case of Android, any idea? – CoDe Sep 13 '13 at 19:49
Better to give the main url, the specific versions can be deleted ;) => http://commons.apache.org/proper/commons-lang/ – Seynorth Dec 17 '13 at 15:23
1

StringEscapeUtils is deprecated. The reply just below is now the most correct. – allemattio Aug 11 '17 at 12:46
No offense, I'm a fervent Apache commons supported and an Apache Fundation Software member but I agree wth allemattio. From experience I'd rather use jsoup as suggested below. – JacquesLeRoux Nov 27 '17 at 13:20
You please add the *gradle implementation* for the next beacuse i almost can fount that. implementation 'org.apache.commons:commons-text:1.0' – SonickSeven Jun 25 '20 at 02:53

score 28 · Answer 2 · edited Sep 27 '12 at 11:14

28

Here you have to just add jar file in lib jsoup in your application and then use this code.

import org.jsoup.Jsoup;

public class Encoder {
    public static void main(String args[]) {
        String s = Jsoup.parse("&lt;Fran&ccedil;ais&gt;").text();
        System.out.print(s);
    }
}

Link to download jsoup: http://jsoup.org/download

edited Sep 27 '12 at 11:14

nhahtdh

55,989
15
126
162

answered Sep 27 '12 at 04:52

jem

41
2
4

This should be the accepted answer. No other library is faster nor easier to import than the amazing Jsoup. – Grux May 13 '15 at 08:55
Awesome. This is answer. – Sattar Hummatli Aug 17 '16 at 12:24

score 7 · Answer 3 · answered Mar 01 '09 at 11:29

java.net.URLDecoder deals only with the application/x-www-form-urlencoded MIME format (e.g. "%20" represents space), not with HTML character entities. I don't think there's anything on the Java platform for that. You could write your own utility class to do the conversion, like this one.

score 5 · Answer 4 · answered Mar 01 '09 at 11:37

5

The URL decoder should only be used for decoding strings from the urls generated by html forms which are in the "application/x-www-form-urlencoded" mime type. This does not support html characters.

After a search I found a Translate class within the HTML Parser library.

answered Mar 01 '09 at 11:37

Rich

325
2
6

very good library, now it's easy to do something like – Miguel Aug 17 '12 at 14:58

score 4 · Answer 5 · edited Dec 12 '17 at 11:57

4

You can use the class org.apache.commons.lang.StringEscapeUtils:

String s = StringEscapeUtils.unescapeHtml("Happy &amp; Sad")

It is working.

edited Dec 12 '17 at 11:57

pirho

11,565
12
43
70

answered Dec 12 '17 at 11:37

Bruno Barros

1
1

1

I prefer this solution. When possible I suggest using Apache libs. (my opinion) – Andrea Girardi Jan 12 '18 at 14:09

score 2 · Answer 6 · answered Apr 07 '18 at 00:02

2

Or you can use unescapeHtml4:

    String miCadena="GU&#205;A TELEF&#211;NICA";
    System.out.println(StringEscapeUtils.unescapeHtml4(miCadena));

This code print the line: GUÍA TELEFÓNICA

answered Apr 07 '18 at 00:02

Heriberto Gutiérrez Gutiérrez

1
2

score 2 · Answer 7 · answered Mar 01 '09 at 11:15

I'm not aware of any way to do it using the standard library. But I do know and use this class that deals with html entities.

"HTMLEntities is an Open Source Java class that contains a collection of static methods (htmlentities, unhtmlentities, ...) to convert special and extended characters into HTML entitities and vice versa."

http://www.tecnick.com/public/code/cp_dpage.php?aiocp_dp=htmlentities

score 1 · Answer 8 · edited Mar 10 '17 at 12:41

1

As @jem suggested, it is possible to use jsoup.

With jSoup 1.8.3 it il possible to use the method Parser.unescapeEntities that retain the original html.

import org.jsoup.parser.Parser;
...
String html = Parser.unescapeEntities(original_html, false);

It seems that in some previous release this method is not present.

edited Mar 10 '17 at 12:41

Evan Knowles

7,426
2
37
71

answered Sep 25 '15 at 14:27

Daniele

821
7
18

Convert HTML Character Back to Text Using Java Standard Library

8 Answers8

Linked

Related