9

I have strings like:

Avery® Laser & Inkjet Self-Adhesive

I need to convert them to

Avery Laser & Inkjet Self-Adhesive.

I.e. remove special characters and convert html special chars to regular ones.

Vladimir
  • 12,753
  • 19
  • 62
  • 77
  • I'm interested in why are you getting the HTML encoded strings... In my "ideal" app the programmer never should have to... (simply encode to html the result, but receiving it... never) – helios Feb 18 '10 at 09:24
  • It's legacy code which saves data it such raw format I need to read and convert it. – Vladimir Feb 18 '10 at 09:41
  • 2
    Oh. In case of strange chars... it looks like it originally was a UTF-8 char and was decoded (readed) as ISO-8859-1 (Western ISO)... by example. If you have a Ñ, it has 2 bytes in UTF-8, so if you read it in iso-western it reads to strange chars. If it's the case and you know the encodings you code use `new String(byte[], encodingName)` and `someString.getBytes(encodingName)` to obtain the good chars. – helios Feb 18 '10 at 10:01

4 Answers4

20
Avery® Laser & Inkjet Self-Adhesive

First use StringEscapeUtils#unescapeHtml4() (or #unescapeXml(), depending on the original format) to unescape the & into a &. Then use String#replaceAll() with [^\x20-\x7e] to get rid of characters which aren't inside the printable ASCII range.

Summarized:

String clean = StringEscapeUtils.unescapeHtml4(dirty).replaceAll("[^\\x20-\\x7e]", "");

..which produces

Avery Laser & Inkjet Self-Adhesive

(without the trailing dot as in your example, but that wasn't present in the original ;) )

That said, this however look like more a request to workaround than a request to solution. If you elaborate more about the functional requirement and/or where this string did originate, we may be able to provide the right solution. The ® namely look like to be caused by using the wrong encoding to read the string in and the & look like to be caused by using a textbased parser to read the string in instead of a fullfledged HTML parser.

lando
  • 440
  • 2
  • 13
BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
  • Yep, trailing dot is my typo) You're right saying this kind of strings are result of textbased parser reading html. – Vladimir Feb 19 '10 at 11:45
6

You can use the StringEscapeUtils class from Apache Commons Text project.

lando
  • 440
  • 2
  • 13
Romain Linsolas
  • 79,475
  • 49
  • 202
  • 273
1

Incase you want to mimic what php function htmlspecialchars_decode does use php function get_html_translation_table() to dump the table and then use the java code like,

    static Hashtable html_specialchars_table = new Hashtable();
    static {
            html_specialchars_table.put("&lt;","<");
            html_specialchars_table.put("&gt;",">");
            html_specialchars_table.put("&amp;","&");
    }
    static String htmlspecialchars_decode_ENT_NOQUOTES(String s){
            Enumeration en = html_specialchars_table.keys();
            while(en.hasMoreElements()){
                    String key = (String)en.nextElement();
                    String val = (String)html_specialchars_table.get(key);
                    s = s.replaceAll(key, val);
            }
            return s;
    }
Bala Dutt
  • 55
  • 1
1

Maybe you can use something like:

yourTxt = yourTxt.replaceAll("&amp;", "&");

in some project I did something like:

public String replaceAcutesHTML(String str) {

str = str.replaceAll("&aacute;","á");
str = str.replaceAll("&eacute;","é");
str = str.replaceAll("&iacute;","í");
str = str.replaceAll("&oacute;","ó");
str = str.replaceAll("&uacute;","ú");
str = str.replaceAll("&Aacute;","Á");
str = str.replaceAll("&Eacute;","É");
str = str.replaceAll("&Iacute;","Í");
str = str.replaceAll("&Oacute;","Ó");
str = str.replaceAll("&Uacute;","Ú");
str = str.replaceAll("&ntilde;","ñ");
str = str.replaceAll("&Ntilde;","Ñ");

return str;

}

oropher
  • 29
  • 3
  • That means that you need to unescape every occurrence of every placeholder in HTML, which is a pain, especially when someone has already written it for you. – Chinmay Kanchi Feb 18 '10 at 15:12
  • That would work, but its not an ideal approach. To do that you'd have to build (and maintain) a set of all special characters to replace. It's better to use an existing library or encoder than to do manual replacements where possible. It also happens to be easier and less tedious to implement! – Freiheit Feb 18 '10 at 15:12